firefrost-gaming/claude-skills-reference

Files

Reza Rezvani 7911cf957a feat(autoresearch-agent): fix critical bugs, package as plugin with 5 slash commands

**Bug fixes (run_experiment.py):**
- Fix broken revert logic: was saving HEAD as pre_commit (no-op revert),
  now uses git reset --hard HEAD~1 for correct rollback
- Remove broken --loop mode (agent IS the loop, script handles one iteration)
- Fix shell injection: all git commands use subprocess list form
- Replace shell tail with Python file read

**Bug fixes (other scripts):**
- setup_experiment.py: fix shell injection in git branch creation,
  remove dead --skip-baseline flag, fix evaluator docstring parsing
- log_results.py: fix 6 falsy-zero bugs (baseline=0 treated as None),
  add domain_filter to CSV/markdown export, move import time to top
- evaluators: add FileNotFoundError handling, fix output format mismatch
  in llm_judge_copy, add peak_kb on macOS, add ValueError handling

**Plugin packaging (NEW):**
- plugin.json, settings.json, CLAUDE.md for plugin registry
- 5 slash commands: /ar:setup, /ar:run, /ar:loop, /ar:status, /ar:resume
- /ar:loop supports user-selected intervals (10m, 1h, daily, weekly, monthly)
- experiment-runner agent for autonomous loop iterations
- Registered in marketplace.json as plugin #20

**SKILL.md rewrite:**
- Replace ambiguous "Loop Protocol" with clear "Agent Protocol"
- Add results.tsv format spec, strategy escalation, self-improvement
- Replace "NEVER STOP" with resumable stopping logic

**Docs & sync:**
- Codex (157 skills), Gemini (229 items), convert.sh all pick up the skill
- 6 new MkDocs pages, mkdocs.yml nav updated
- Counts updated: 17 agents, 22 slash commands

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

2026-03-13 14:38:59 +01:00

3.3 KiB

Raw Blame History

title, description

title	description
/ar:setup — Create New Experiment	/ar:setup — Create New Experiment - Claude Code skill from the Engineering - POWERFUL domain.

/ar:setup — Create New Experiment

:material-rocket-launch: Engineering - POWERFUL :material-identifier: `setup` :material-github: Source

Install: claude /plugin install engineering-advanced-skills

Set up a new autoresearch experiment with all required configuration.

Usage

/ar:setup                                    # Interactive mode
/ar:setup engineering api-speed src/api.py "pytest bench.py" p50_ms lower
/ar:setup --list                             # Show existing experiments
/ar:setup --list-evaluators                  # Show available evaluators

What It Does

If arguments provided

Pass them directly to the setup script:

python {skill_path}/scripts/setup_experiment.py \
  --domain {domain} --name {name} \
  --target {target} --eval "{eval_cmd}" \
  --metric {metric} --direction {direction} \
  [--evaluator {evaluator}] [--scope {scope}]

If no arguments (interactive mode)

Collect each parameter one at a time:

Domain — Ask: "What domain? (engineering, marketing, content, prompts, custom)"
Name — Ask: "Experiment name? (e.g., api-speed, blog-titles)"
Target file — Ask: "Which file to optimize?" Verify it exists.
Eval command — Ask: "How to measure it? (e.g., pytest bench.py, python evaluate.py)"
Metric — Ask: "What metric does the eval output? (e.g., p50_ms, ctr_score)"
Direction — Ask: "Is lower or higher better?"
Evaluator (optional) — Show built-in evaluators. Ask: "Use a built-in evaluator, or your own?"
Scope — Ask: "Store in project (.autoresearch/) or user (~/.autoresearch/)?"

Then run setup_experiment.py with the collected parameters.

Listing

# Show existing experiments
python {skill_path}/scripts/setup_experiment.py --list

# Show available evaluators
python {skill_path}/scripts/setup_experiment.py --list-evaluators

Built-in Evaluators

Name	Metric	Use Case
`benchmark_speed`	`p50_ms` (lower)	Function/API execution time
`benchmark_size`	`size_bytes` (lower)	File, bundle, Docker image size
`test_pass_rate`	`pass_rate` (higher)	Test suite pass percentage
`build_speed`	`build_seconds` (lower)	Build/compile/Docker build time
`memory_usage`	`peak_mb` (lower)	Peak memory during execution
`llm_judge_content`	`ctr_score` (higher)	Headlines, titles, descriptions
`llm_judge_prompt`	`quality_score` (higher)	System prompts, agent instructions
`llm_judge_copy`	`engagement_score` (higher)	Social posts, ad copy, emails

After Setup

Report to the user:

Experiment path and branch name
Whether the eval command worked and the baseline metric
Suggest: "Run /ar:run {domain}/{name} to start iterating, or /ar:loop {domain}/{name} for autonomous mode."

3.3 KiB Raw Blame History