Reza Rezvani
|
7911cf957a
|
feat(autoresearch-agent): fix critical bugs, package as plugin with 5 slash commands
**Bug fixes (run_experiment.py):**
- Fix broken revert logic: was saving HEAD as pre_commit (no-op revert),
now uses git reset --hard HEAD~1 for correct rollback
- Remove broken --loop mode (agent IS the loop, script handles one iteration)
- Fix shell injection: all git commands use subprocess list form
- Replace shell tail with Python file read
**Bug fixes (other scripts):**
- setup_experiment.py: fix shell injection in git branch creation,
remove dead --skip-baseline flag, fix evaluator docstring parsing
- log_results.py: fix 6 falsy-zero bugs (baseline=0 treated as None),
add domain_filter to CSV/markdown export, move import time to top
- evaluators: add FileNotFoundError handling, fix output format mismatch
in llm_judge_copy, add peak_kb on macOS, add ValueError handling
**Plugin packaging (NEW):**
- plugin.json, settings.json, CLAUDE.md for plugin registry
- 5 slash commands: /ar:setup, /ar:run, /ar:loop, /ar:status, /ar:resume
- /ar:loop supports user-selected intervals (10m, 1h, daily, weekly, monthly)
- experiment-runner agent for autonomous loop iterations
- Registered in marketplace.json as plugin #20
**SKILL.md rewrite:**
- Replace ambiguous "Loop Protocol" with clear "Agent Protocol"
- Add results.tsv format spec, strategy escalation, self-improvement
- Replace "NEVER STOP" with resumable stopping logic
**Docs & sync:**
- Codex (157 skills), Gemini (229 items), convert.sh all pick up the skill
- 6 new MkDocs pages, mkdocs.yml nav updated
- Counts updated: 17 agents, 22 slash commands
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
2026-03-13 14:38:59 +01:00 |
|
Leo
|
12591282da
|
refactor: autoresearch-agent v2.0 — multi-experiment, multi-domain, real-world evaluators
Major rewrite based on deep study of Karpathy's autoresearch repo.
Architecture changes:
- Multi-experiment support: .autoresearch/{domain}/{name}/ structure
- Domain categories: engineering, marketing, content, prompts, custom
- Project-level (git-tracked, shareable) or user-level (~/.autoresearch/) scope
- User chooses scope during setup, not installation
New evaluators (8 ready-to-use):
- Free: benchmark_speed, benchmark_size, test_pass_rate, build_speed, memory_usage
- LLM judge (uses existing subscription): llm_judge_content, llm_judge_prompt, llm_judge_copy
- LLM judges call user's CLI tool (claude/codex/gemini) — no extra API keys needed
Script improvements:
- setup_experiment.py: --domain, --scope, --evaluator, --list, --list-evaluators
- run_experiment.py: --experiment domain/name, --resume, --loop, --single
- log_results.py: --dashboard, --domain, --format csv|markdown|terminal, --output
Results export:
- Terminal (default), CSV, and Markdown formats
- Per-experiment, per-domain, or cross-experiment dashboard view
SKILL.md rewritten:
- Clear activation triggers (when the skill should activate)
- Practical examples for each domain
- Evaluator documentation with cost transparency
- Simplified loop protocol matching Karpathy's original philosophy
|
2026-03-13 08:22:29 +01:00 |
|
Leo
|
a799d8bdb8
|
feat: add autoresearch-agent — autonomous experiment loop for ML, prompt, code & skill optimization
Inspired by Karpathy's autoresearch. The agent modifies a target file, runs a
fixed evaluation, keeps improvements (git commit), discards failures (git reset),
and loops indefinitely — no human in the loop.
Includes:
- SKILL.md with setup wizard, 4 domain configs, experiment loop protocol
- 3 stdlib-only Python scripts (setup, run, log — 687 lines)
- Reference docs: experiment domains guide, program.md templates
Domains: ML training (val_bpb), prompt engineering (eval_score),
code performance (p50_ms), agent skill optimization (pass_rate).
Cherry-picked from feat/autoresearch-agent and rebased onto dev.
Fixes: timeout inconsistency (2x→2.5x), results.tsv tracking clarity,
zero-metric edge case, installation section aligned with multi-tool support.
|
2026-03-13 07:21:44 +01:00 |
|