claude-skills-reference

firefrost-gaming/claude-skills-reference

Files

Leo 12591282da refactor: autoresearch-agent v2.0 — multi-experiment, multi-domain, real-world evaluators

Major rewrite based on deep study of Karpathy's autoresearch repo.

Architecture changes:
- Multi-experiment support: .autoresearch/{domain}/{name}/ structure
- Domain categories: engineering, marketing, content, prompts, custom
- Project-level (git-tracked, shareable) or user-level (~/.autoresearch/) scope
- User chooses scope during setup, not installation

New evaluators (8 ready-to-use):
- Free: benchmark_speed, benchmark_size, test_pass_rate, build_speed, memory_usage
- LLM judge (uses existing subscription): llm_judge_content, llm_judge_prompt, llm_judge_copy
- LLM judges call user's CLI tool (claude/codex/gemini) — no extra API keys needed

Script improvements:
- setup_experiment.py: --domain, --scope, --evaluator, --list, --list-evaluators
- run_experiment.py: --experiment domain/name, --resume, --loop, --single
- log_results.py: --dashboard, --domain, --format csv|markdown|terminal, --output

Results export:
- Terminal (default), CSV, and Markdown formats
- Per-experiment, per-domain, or cross-experiment dashboard view

SKILL.md rewritten:
- Clear activation triggers (when the skill should activate)
- Practical examples for each domain
- Evaluator documentation with cost transparency
- Simplified loop protocol matching Karpathy's original philosophy

2026-03-13 08:22:29 +01:00

evaluators

refactor: autoresearch-agent v2.0 — multi-experiment, multi-domain, real-world evaluators

2026-03-13 08:22:29 +01:00

references

refactor: autoresearch-agent v2.0 — multi-experiment, multi-domain, real-world evaluators

2026-03-13 08:22:29 +01:00

scripts

refactor: autoresearch-agent v2.0 — multi-experiment, multi-domain, real-world evaluators

2026-03-13 08:22:29 +01:00

SKILL.md

refactor: autoresearch-agent v2.0 — multi-experiment, multi-domain, real-world evaluators

2026-03-13 08:22:29 +01:00