feat(autoresearch-agent): fix critical bugs, package as plugin with 5 slash commands

**Bug fixes (run_experiment.py):** - Fix broken revert logic: was saving HEAD as pre_commit (no-op revert), now uses git reset --hard HEAD~1 for correct rollback - Remove broken --loop mode (agent IS the loop, script handles one iteration) - Fix shell injection: all git commands use subprocess list form - Replace shell tail with Python file read **Bug fixes (other scripts):** - setup_experiment.py: fix shell injection in git branch creation, remove dead --skip-baseline flag, fix evaluator docstring parsing - log_results.py: fix 6 falsy-zero bugs (baseline=0 treated as None), add domain_filter to CSV/markdown export, move import time to top - evaluators: add FileNotFoundError handling, fix output format mismatch in llm_judge_copy, add peak_kb on macOS, add ValueError handling **Plugin packaging (NEW):** - plugin.json, settings.json, CLAUDE.md for plugin registry - 5 slash commands: /ar:setup, /ar:run, /ar:loop, /ar:status, /ar:resume - /ar:loop supports user-selected intervals (10m, 1h, daily, weekly, monthly) - experiment-runner agent for autonomous loop iterations - Registered in marketplace.json as plugin #20 **SKILL.md rewrite:** - Replace ambiguous "Loop Protocol" with clear "Agent Protocol" - Add results.tsv format spec, strategy escalation, self-improvement - Replace "NEVER STOP" with resumable stopping logic **Docs & sync:** - Codex (157 skills), Gemini (229 items), convert.sh all pick up the skill - 6 new MkDocs pages, mkdocs.yml nav updated - Counts updated: 17 agents, 22 slash commands Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-13 14:38:59 +01:00
parent 6dc25df8fa
commit 7911cf957a
39 changed files with 1779 additions and 234 deletions
--- a/engineering/autoresearch-agent/.claude-plugin/plugin.json
+++ b/engineering/autoresearch-agent/.claude-plugin/plugin.json
@@ -0,0 +1,13 @@
+{
+  "name": "autoresearch-agent",
+  "description": "Autonomous experiment loop that optimizes any file by a measurable metric. 5 slash commands, 8 evaluators, configurable loop intervals (10min to monthly).",
+  "version": "2.1.2",
+  "author": {
+    "name": "Alireza Rezvani",
+    "url": "https://alirezarezvani.com"
+  },
+  "homepage": "https://github.com/alirezarezvani/claude-skills/tree/main/engineering/autoresearch-agent",
+  "repository": "https://github.com/alirezarezvani/claude-skills",
+  "license": "MIT",
+  "skills": "./"
+}
--- a/engineering/autoresearch-agent/CLAUDE.md
+++ b/engineering/autoresearch-agent/CLAUDE.md
@@ -0,0 +1,66 @@
+# Autoresearch Agent — Claude Code Instructions
+
+This plugin runs autonomous experiment loops that optimize any file by a measurable metric.
+
+## Commands
+
+Use the `/ar:` namespace for all commands:
+
+- `/ar:setup` — Set up a new experiment interactively
+- `/ar:run` — Run a single experiment iteration
+- `/ar:loop` — Start an autonomous loop with user-selected interval
+- `/ar:status` — Show dashboard and results
+- `/ar:resume` — Resume a paused experiment
+
+## How it works
+
+You (the AI agent) are the experiment loop. The scripts handle evaluation and git rollback.
+
+1. You edit the target file with ONE change
+2. You commit it
+3. You call `run_experiment.py --single` — it evaluates and prints KEEP/DISCARD/CRASH
+4. You repeat
+
+Results persist in `results.tsv` and git log. Sessions can be resumed.
+
+## When to use each command
+
+### Starting fresh
+```
+/ar:setup
+```
+Creates the experiment directory, config, program.md, results.tsv, and git branch.
+
+### Running one iteration at a time
+```
+/ar:run engineering/api-speed
+```
+Read history, make one change, evaluate, report result.
+
+### Autonomous background loop
+```
+/ar:loop engineering/api-speed
+```
+Prompts for interval (10min, 1h, daily, weekly, monthly), then creates a recurring job.
+
+### Checking progress
+```
+/ar:status
+```
+Shows the dashboard across all experiments with metrics and trends.
+
+### Resuming after context limit or break
+```
+/ar:resume engineering/api-speed
+```
+Reads results history, checks out the branch, and continues where you left off.
+
+## Agents
+
+- **experiment-runner**: Spawned for each loop iteration. Reads config, results history, decides what to try, edits target, commits, evaluates.
+
+## Key principle
+
+**One change per experiment. Measure everything. Compound improvements.**
+
+The agent never modifies the evaluator. The evaluator is ground truth.
--- a/engineering/autoresearch-agent/SKILL.md
+++ b/engineering/autoresearch-agent/SKILL.md
@@ -19,6 +19,18 @@ Not one guess — fifty measured attempts, compounding.

 ---

+## Slash Commands
+
+| Command | What it does |
+|---------|-------------|
+| `/ar:setup` | Set up a new experiment interactively |
+| `/ar:run` | Run a single experiment iteration |
+| `/ar:loop` | Start autonomous loop with configurable interval (10m, 1h, daily, weekly, monthly) |
+| `/ar:status` | Show dashboard and results |
+| `/ar:resume` | Resume a paused experiment |
+
+---
+
 ## When This Skill Activates

 Recognize these patterns from the user:
@@ -82,6 +94,12 @@ The `--scope` flag determines where `.autoresearch/` lives:
    └── evaluate.py                    ← Evaluation script (if --evaluator used)
 ```

+**results.tsv columns:** `commit | metric | status | description`
+- `commit` — short git hash
+- `metric` — float value or "N/A" for crashes
+- `status` — keep | discard | crash
+- `description` — what changed or why it crashed
+
 ### Domains

 | Domain | Use Cases |
@@ -98,48 +116,67 @@ The user may have written their own `program.md`. If found in the experiment dir

 ---

-## The Experiment Loop
+## Agent Protocol
+
+You are the loop. The scripts handle setup and evaluation — you handle the creative work.
+
+### Before Starting
+1. Read `.autoresearch/{domain}/{name}/config.cfg` to get:
+   - `target` — the file you edit
+   - `evaluate_cmd` — the command that measures your changes
+   - `metric` — the metric name to look for in eval output
+   - `metric_direction` — "lower" or "higher" is better
+   - `time_budget_minutes` — max time per evaluation
+2. Read `program.md` for strategy, constraints, and what you can/cannot change
+3. Read `results.tsv` for experiment history (columns: commit, metric, status, description)
+4. Checkout the experiment branch: `git checkout autoresearch/{domain}/{name}`
+
+### Each Iteration
+1. Review results.tsv — what worked? What failed? What hasn't been tried?
+2. Decide ONE change to the target file. One variable per experiment.
+3. Edit the target file
+4. Commit: `git add {target} && git commit -m "experiment: {description}"`
+5. Evaluate: `python scripts/run_experiment.py --experiment {domain}/{name} --single`
+6. Read the output — it prints KEEP, DISCARD, or CRASH with the metric value
+7. Go to step 1
+
+### What the Script Handles (you don't)
+- Running the eval command with timeout
+- Parsing the metric from eval output
+- Comparing to previous best
+- Reverting the commit on failure (`git reset --hard HEAD~1`)
+- Logging the result to results.tsv

 ### Starting an Experiment

 ```bash
-# Run specific experiment
-python scripts/run_experiment.py --experiment engineering/api-speed --loop
-
-# Single iteration (test setup)
+# Single iteration (the agent calls this repeatedly)
 python scripts/run_experiment.py --experiment engineering/api-speed --single

-# Resume last active experiment
-python scripts/run_experiment.py --resume --loop
-
-# Dry run (show what would happen)
+# Dry run (test setup before starting)
 python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
 ```

-### The Loop Protocol
+### Strategy Escalation
+- Runs 1-5: Low-hanging fruit (obvious improvements, simple optimizations)
+- Runs 6-15: Systematic exploration (vary one parameter at a time)
+- Runs 16-30: Structural changes (algorithm swaps, architecture shifts)
+- Runs 30+: Radical experiments (completely different approaches)
+- If no improvement in 20+ runs: update program.md Strategy section

-```
-LOOP FOREVER:
+### Self-Improvement
+After every 10 experiments, review results.tsv for patterns. Update the
+Strategy section of program.md with what you learned (e.g., "caching changes
+consistently improve by 5-10%", "refactoring attempts never improve the metric").
+Future iterations benefit from this accumulated knowledge.

-1. Read program.md for current strategy and constraints
-2. Review git log: what has been tried? What worked? What crashed?
-3. Review results.tsv: current best metric, trend, recent failures
-4. Propose ONE change to the target file
-5. Apply the change
-6. git commit -m "experiment: [short description of what changed]"
-7. Run evaluation: {eval_command} > .autoresearch/{domain}/{name}/run.log 2>&1
-8. Parse metric from run.log (grep for metric_name: value)
-9. Decision:
-   - Metric improved → KEEP (advance branch, log "keep")
-   - Metric equal or worse → REVERT (git reset --hard, log "discard")
-   - Crash/timeout/parse failure → attempt fix once, else REVERT (log "crash")
-10. Append result to results.tsv
-11. Go to 1
-```
+### Stopping
+- Run until interrupted by the user, context limit reached, or goal in program.md is met
+- Before stopping: ensure results.tsv is up to date
+- On context limit: the next session can resume — results.tsv and git log persist

 ### Rules

- **NEVER STOP.** The human may be asleep. Run until manually interrupted. If you run out of ideas, read papers, re-read the target, try combining previous near-misses, try radical changes.
 - **One change per experiment.** Don't change 5 things at once. You won't know what worked.
 - **Simplicity criterion.** A small improvement that adds ugly complexity is not worth it. Equal performance with simpler code is a win. Removing code that gets same results is the best outcome.
 - **Never modify the evaluator.** `evaluate.py` is the ground truth. Modifying it invalidates all comparisons. Hard stop if you catch yourself doing this.
@@ -258,7 +295,7 @@ cp -r claude-skills/engineering/autoresearch-agent ~/.claude/skills/

 ### OpenClaw
 ```bash
-clawhub install autoresearch-agent
+clawhub install cs-autoresearch-agent
 ```

 ---
--- a/engineering/autoresearch-agent/agents/experiment-runner.md
+++ b/engineering/autoresearch-agent/agents/experiment-runner.md
@@ -0,0 +1,87 @@
+# Experiment Runner Agent
+
+You are an autonomous experimenter. Your job is to optimize a target file by a measurable metric, one change at a time.
+
+## Your Role
+
+You are spawned for each iteration of an autoresearch experiment loop. You:
+1. Read the experiment state (config, strategy, results history)
+2. Decide what to try based on accumulated evidence
+3. Make ONE change to the target file
+4. Commit and evaluate
+5. Report the result
+
+## Process
+
+### 1. Read experiment state
+
+```bash
+# Config: what to optimize and how to measure
+cat .autoresearch/{domain}/{name}/config.cfg
+
+# Strategy: what you can/cannot change, current approach
+cat .autoresearch/{domain}/{name}/program.md
+
+# History: every experiment ever run, with outcomes
+cat .autoresearch/{domain}/{name}/results.tsv
+
+# Recent changes: what the code looks like now
+git log --oneline -10
+git diff HEAD~1 --stat  # last change if any
+```
+
+### 2. Analyze results history
+
+From results.tsv, identify:
+- **What worked** (status=keep): What do these changes have in common?
+- **What failed** (status=discard): What approaches should you avoid?
+- **What crashed** (status=crash): Are there fragile areas to be careful with?
+- **Trends**: Is the metric plateauing? Accelerating? Oscillating?
+
+### 3. Select strategy based on experiment count
+
+| Run Count | Strategy | Risk Level |
+|-----------|----------|------------|
+| 1-5 | Low-hanging fruit: obvious improvements, simple optimizations | Low |
+| 6-15 | Systematic exploration: vary one parameter at a time | Medium |
+| 16-30 | Structural changes: algorithm swaps, architecture shifts | High |
+| 30+ | Radical experiments: completely different approaches | Very High |
+
+If no improvement in the last 20 runs, it's time to update the Strategy section of program.md and try something fundamentally different.
+
+### 4. Make ONE change
+
+- Edit only the target file (from config.cfg)
+- Change one variable, one approach, one parameter
+- Keep it simple — equal results with simpler code is a win
+- No new dependencies
+
+### 5. Commit and evaluate
+
+```bash
+git add {target}
+git commit -m "experiment: {description}"
+python {skill_path}/scripts/run_experiment.py --experiment {domain}/{name} --single
+```
+
+### 6. Self-improvement
+
+After every 10th experiment, update program.md's Strategy section:
+- Which approaches consistently work? Double down.
+- Which approaches consistently fail? Stop trying.
+- Any new hypotheses based on the data?
+
+## Hard Rules
+
+- **ONE change per experiment.** Multiple changes = you won't know what worked.
+- **NEVER modify the evaluator.** evaluate.py is the ground truth. Modifying it invalidates all comparisons. If you catch yourself doing this, stop immediately.
+- **5 consecutive crashes → stop.** Alert the user. Don't burn cycles on a broken setup.
+- **Simplicity criterion.** A small improvement that adds ugly complexity is NOT worth it. Removing code that gets same results is the best outcome.
+- **No new dependencies.** Only use what's already available.
+
+## Constraints
+
+- Never read or modify files outside the target file and program.md
+- Never push to remote — all work stays local
+- Never skip the evaluation step — every change must be measured
+- Be concise in commit messages — they become the experiment log
--- a/engineering/autoresearch-agent/evaluators/benchmark_size.py
+++ b/engineering/autoresearch-agent/evaluators/benchmark_size.py
@@ -36,7 +36,11 @@ if "DOCKER_IMAGE" in dir() or "DOCKER_IMAGE" in globals():
        f"docker image inspect {DOCKER_IMAGE} --format '{{{{.Size}}}}'",
        shell=True, capture_output=True, text=True
    )
-    size_bytes = int(result.stdout.strip())
+    try:
+        size_bytes = int(result.stdout.strip())
+    except ValueError:
+        print(f"Could not parse size from: {result.stdout[:100]}", file=sys.stderr)
+        sys.exit(1)
 elif "TARGET_DIR" in dir() or "TARGET_DIR" in globals():
    size_bytes = sum(
        os.path.getsize(os.path.join(dp, f))
--- a/engineering/autoresearch-agent/evaluators/llm_judge_content.py
+++ b/engineering/autoresearch-agent/evaluators/llm_judge_content.py
@@ -43,7 +43,12 @@ ctr_score: <average of all 5 scores>

 Be harsh. Most content is mediocre (4-6 range). Only exceptional content scores 8+."""

-content = Path(TARGET_FILE).read_text()
+try:
+    content = Path(TARGET_FILE).read_text()
+except FileNotFoundError:
+    print(f"Target file not found: {TARGET_FILE}", file=sys.stderr)
+    sys.exit(1)
+
 full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nContent to evaluate:\n\n{content}"

 # Call the user's CLI tool
--- a/engineering/autoresearch-agent/evaluators/llm_judge_copy.py
+++ b/engineering/autoresearch-agent/evaluators/llm_judge_copy.py
@@ -54,6 +54,9 @@ platform_prompt = JUDGE_PROMPTS.get(PLATFORM, JUDGE_PROMPTS["twitter"])

 JUDGE_PROMPT = f"""{platform_prompt}

+IMPORTANT: You MUST use criterion_1 through criterion_5 as labels, NOT the criterion names.
+Do NOT output "hook: 7" — output "criterion_1: 7".
+
 Output EXACTLY this format:
 criterion_1: <score>
 criterion_2: <score>
@@ -64,7 +67,12 @@ engagement_score: <average of all 5>

 Be harsh. Most copy is mediocre (4-6). Only exceptional copy scores 8+."""

-content = Path(TARGET_FILE).read_text()
+try:
+    content = Path(TARGET_FILE).read_text()
+except FileNotFoundError:
+    print(f"Target file not found: {TARGET_FILE}", file=sys.stderr)
+    sys.exit(1)
+
 full_prompt = f"{JUDGE_PROMPT}\n\n---\n\nCopy to evaluate:\n\n{content}"

 result = subprocess.run(
@@ -77,12 +85,29 @@ if result.returncode != 0:
    sys.exit(1)

 output = result.stdout
+found_scores = False
 for line in output.splitlines():
    line = line.strip()
    if line.startswith("engagement_score:") or line.startswith("criterion_"):
        print(line)
+        found_scores = True

-if "engagement_score:" not in output:
+# Fallback: if no criterion_ lines found, try parsing any "word: digit" lines
+if not found_scores:
+    import re
+    fallback_scores = []
+    for line in output.splitlines():
+        line = line.strip()
+        match = re.match(r'^(\w[\w\s]*?):\s*(\d+(?:\.\d+)?)\s*$', line)
+        if match and match.group(1).lower() not in ("engagement_score",):
+            fallback_scores.append(float(match.group(2)))
+            print(f"criterion_{len(fallback_scores)}: {match.group(2)}")
+    if fallback_scores:
+        avg = sum(fallback_scores) / len(fallback_scores)
+        print(f"engagement_score: {avg:.1f}")
+        found_scores = True
+
+if "engagement_score:" not in output and not found_scores:
    print("Could not parse engagement_score from LLM output", file=sys.stderr)
    print(f"Raw: {output[:500]}", file=sys.stderr)
    sys.exit(1)
--- a/engineering/autoresearch-agent/evaluators/llm_judge_prompt.py
+++ b/engineering/autoresearch-agent/evaluators/llm_judge_prompt.py
@@ -37,8 +37,17 @@ Score the actual output on these criteria (each 1-10):
 Output EXACTLY: quality_score: <average of all 4>
 Nothing else."""

-prompt = Path(TARGET_FILE).read_text()
-test_cases = json.loads(Path(TEST_CASES_FILE).read_text())
+try:
+    prompt = Path(TARGET_FILE).read_text()
+except FileNotFoundError:
+    print(f"Target file not found: {TARGET_FILE}", file=sys.stderr)
+    sys.exit(1)
+
+try:
+    test_cases = json.loads(Path(TEST_CASES_FILE).read_text())
+except FileNotFoundError:
+    print(f"Test cases file not found: {TEST_CASES_FILE}", file=sys.stderr)
+    sys.exit(1)

 scores = []

@@ -92,7 +101,7 @@ if not scores:
    sys.exit(1)

 avg = sum(scores) / len(scores)
-quality = avg * 10  # Scale to 0-100
+quality = avg * 10  # 1-10 scores → 10-100 range

 print(f"quality_score: {quality:.2f}")
 print(f"cases_tested: {len(scores)}")
--- a/engineering/autoresearch-agent/evaluators/memory_usage.py
+++ b/engineering/autoresearch-agent/evaluators/memory_usage.py
@@ -2,7 +2,6 @@
 """Measure peak memory usage of a command.
 DO NOT MODIFY after experiment starts — this is the fixed evaluator."""

-import os
 import platform
 import subprocess
 import sys
@@ -41,8 +40,10 @@ elif system == "Darwin":
        if "maximum resident set size" in line.lower():
            # macOS reports in bytes
            val = int(line.strip().split()[0])
+            kb = val / 1024
            mb = val / (1024 * 1024)
            print(f"peak_mb: {mb:.1f}")
+            print(f"peak_kb: {int(kb)}")
            sys.exit(0)
    print("Could not parse memory from time output", file=sys.stderr)
    sys.exit(1)
--- a/engineering/autoresearch-agent/references/program-template.md
+++ b/engineering/autoresearch-agent/references/program-template.md
@@ -75,8 +75,8 @@ Maximize eval_score on the test suite. Higher is better (0-100).

 ## Evaluation
 - evaluate.py runs the prompt against 20 test cases
- Each test case is scored 0-5 by GPT-4o
- eval_score = average * 20 (maps to 0-100)
+- Each test case is scored 1-10 by your CLI tool (Claude, Codex, or Gemini)
+- quality_score = average * 10 (maps to 10-100)
 - Run log shows which test cases failed

 ## Stop When
@@ -144,14 +144,14 @@ Maximize pass_rate on the task evaluation suite. Higher is better (0-1).
 - Proactive trigger conditions

 ## What You Cannot Change
- scripts/skill_evaluator.py (fixed evaluation)
+- your custom evaluate.py (see Custom Evaluators in SKILL.md)
 - Test tasks in tests/ (ground truth benchmark)
 - Skill name (used for routing)
 - License or metadata

 ## Evaluation
- skill_evaluator.py runs SKILL.md against 15 standardized tasks
- An AI judge scores each task: 0 (fail), 0.5 (partial), 1 (pass)
+- evaluate.py runs SKILL.md against 15 standardized tasks
+- Your CLI tool scores each task: 0 (fail), 0.5 (partial), 1 (pass)
 - pass_rate = sum(scores) / 15

 ## Strategy
--- a/engineering/autoresearch-agent/scripts/log_results.py
+++ b/engineering/autoresearch-agent/scripts/log_results.py
@@ -18,6 +18,7 @@ import argparse
 import csv
 import io
 import sys
+import time
 from pathlib import Path


@@ -80,7 +81,7 @@ def compute_stats(results, direction):
        best = None

    pct_change = None
-    if baseline and best and baseline != 0:
+    if baseline is not None and best is not None and baseline != 0:
        if direction == "lower":
            pct_change = (baseline - best) / baseline * 100
        else:
@@ -145,18 +146,17 @@ def print_dashboard(root):
            direction = config.get("metric_direction", "lower")
            stats = compute_stats(results, direction)

+            best_str = f"{stats['best']:.4f}" if stats["best"] is not None else "—"
+            pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else "—"
+
            # Determine status
            status = "idle"
            if stats["total"] > 0:
                tsv = exp_dir / "results.tsv"
                if tsv.exists():
-                    import time
                    age_hours = (time.time() - tsv.stat().st_mtime) / 3600
                    status = "active" if age_hours < 1 else "paused" if age_hours < 24 else "done"

-            best_str = f"{stats['best']:.4f}" if stats["best"] is not None else "—"
-            pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else "—"
-
            experiments.append({
                "domain": domain_dir.name,
                "name": exp_dir.name,
@@ -202,7 +202,7 @@ def export_experiment_csv(experiment_dir, experiment_path):
    if stats["baseline"] is not None:
        writer.writerow(["# Baseline", f"{stats['baseline']:.6f}"])
    if stats["best"] is not None:
-        pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] else ""
+        pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] is not None else ""
        writer.writerow(["# Best", f"{stats['best']:.6f}{pct}"])
    writer.writerow(["# Total", stats["total"]])
    writer.writerow(["# Keep/Discard/Crash", f"{stats['keeps']}/{stats['discards']}/{stats['crashes']}"])
@@ -216,12 +216,14 @@ def export_experiment_csv(experiment_dir, experiment_path):
    return buf.getvalue()


-def export_dashboard_csv(root):
+def export_dashboard_csv(root, domain_filter=None):
    """Export dashboard as CSV string."""
    experiments = []
    for domain_dir in sorted(root.iterdir()):
        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
            continue
+        if domain_filter and domain_dir.name != domain_filter:
+            continue
        for exp_dir in sorted(domain_dir.iterdir()):
            if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
                continue
@@ -229,8 +231,8 @@ def export_dashboard_csv(root):
            results = load_results(exp_dir)
            direction = config.get("metric_direction", "lower")
            stats = compute_stats(results, direction)
-            best_str = f"{stats['best']:.6f}" if stats["best"] else ""
-            pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] else ""
+            best_str = f"{stats['best']:.6f}" if stats["best"] is not None else ""
+            pct_str = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else ""
            experiments.append([
                domain_dir.name, exp_dir.name, config.get("metric", ""),
                stats["total"], stats["keeps"], stats["discards"], stats["crashes"],
@@ -262,7 +264,7 @@ def export_experiment_markdown(experiment_dir, experiment_path):
    lines.append(f"**Experiments:** {stats['total']} total — {stats['keeps']} kept, {stats['discards']} discarded, {stats['crashes']} crashed\n")

    if stats["baseline"] is not None and stats["best"] is not None:
-        pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] else ""
+        pct = f" ({stats['pct_change']:+.1f}%)" if stats["pct_change"] is not None else ""
        lines.append(f"**Progress:** `{stats['baseline']:.6f}` → `{stats['best']:.6f}`{pct}\n")

    lines.append(f"| Commit | Metric | Status | Description |")
@@ -275,7 +277,7 @@ def export_experiment_markdown(experiment_dir, experiment_path):
    return "\n".join(lines)


-def export_dashboard_markdown(root):
+def export_dashboard_markdown(root, domain_filter=None):
    """Export dashboard as Markdown string."""
    lines = []
    lines.append("# Autoresearch Dashboard\n")
@@ -285,6 +287,8 @@ def export_dashboard_markdown(root):
    for domain_dir in sorted(root.iterdir()):
        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
            continue
+        if domain_filter and domain_dir.name != domain_filter:
+            continue
        for exp_dir in sorted(domain_dir.iterdir()):
            if not exp_dir.is_dir() or not (exp_dir / "config.cfg").exists():
                continue
@@ -292,10 +296,9 @@ def export_dashboard_markdown(root):
            results = load_results(exp_dir)
            direction = config.get("metric_direction", "lower")
            stats = compute_stats(results, direction)
-            best = f"`{stats['best']:.4f}`" if stats["best"] else "—"
-            pct = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] else "—"
+            best = f"`{stats['best']:.4f}`" if stats["best"] is not None else "—"
+            pct = f"{stats['pct_change']:+.1f}%" if stats["pct_change"] is not None else "—"

-            import time
            tsv = exp_dir / "results.tsv"
            status = "idle"
            if tsv.exists() and stats["total"] > 0:
@@ -356,7 +359,7 @@ def main():
                # For CSV/MD, fall through to dashboard with domain filter
        if args.format != "terminal":
            # Use dashboard export filtered to domain
-            output_text = export_dashboard_csv(root) if args.format == "csv" else export_dashboard_markdown(root)
+            output_text = export_dashboard_csv(root, domain_filter=args.domain) if args.format == "csv" else export_dashboard_markdown(root, domain_filter=args.domain)
        else:
            return

--- a/engineering/autoresearch-agent/scripts/run_experiment.py
+++ b/engineering/autoresearch-agent/scripts/run_experiment.py
@@ -2,20 +2,17 @@
 """
 autoresearch-agent: Experiment Runner

-Executes the autonomous experiment loop for a specific experiment.
-Reads config from .autoresearch/{domain}/{name}/config.cfg.
+Executes a single experiment iteration. The AI agent is the loop —
+it calls this script repeatedly. The script handles evaluation,
+metric parsing, keep/discard decisions, and git rollback on failure.

 Usage:
-    python scripts/run_experiment.py --experiment engineering/api-speed --loop
    python scripts/run_experiment.py --experiment engineering/api-speed --single
-    python scripts/run_experiment.py --experiment marketing/medium-ctr --loop
-    python scripts/run_experiment.py --resume --loop
    python scripts/run_experiment.py --experiment engineering/api-speed --dry-run
+    python scripts/run_experiment.py --experiment engineering/api-speed --single --description "added caching"
 """

 import argparse
-import os
-import signal
 import subprocess
 import sys
 import time
@@ -48,10 +45,11 @@ def load_config(experiment_dir):
    return config


-def run_cmd(cmd, cwd=None, timeout=None):
-    """Run shell command, return (returncode, stdout, stderr)."""
+def run_git(args, cwd=None, timeout=30):
+    """Run a git command safely (no shell injection). Returns (returncode, stdout, stderr)."""
    result = subprocess.run(
-        cmd, shell=True, capture_output=True, text=True,
+        ["git"] + args,
+        capture_output=True, text=True,
        cwd=cwd, timeout=timeout
    )
    return result.returncode, result.stdout.strip(), result.stderr.strip()
@@ -59,7 +57,7 @@ def run_cmd(cmd, cwd=None, timeout=None):

 def get_current_commit(path):
    """Get short hash of current HEAD."""
-    _, commit, _ = run_cmd("git rev-parse --short HEAD", cwd=path)
+    _, commit, _ = run_git(["rev-parse", "--short", "HEAD"], cwd=path)
    return commit


@@ -85,17 +83,23 @@ def get_best_metric(experiment_dir, direction):


 def run_evaluation(project_root, eval_cmd, time_budget_minutes, log_file):
-    """Run evaluation with time limit. Output goes to log_file."""
+    """Run evaluation with time limit. Output goes to log_file.
+
+    Note: shell=True is intentional here — eval_cmd is user-provided and
+    may contain pipes, redirects, or chained commands.
+    """
    hard_limit = time_budget_minutes * 60 * 2.5
    t0 = time.time()
    try:
-        code, _, _ = run_cmd(
-            f"{eval_cmd} > {log_file} 2>&1",
-            cwd=str(project_root),
-            timeout=hard_limit
-        )
+        with open(log_file, "w") as lf:
+            result = subprocess.run(
+                eval_cmd, shell=True,
+                stdout=lf, stderr=subprocess.STDOUT,
+                cwd=str(project_root),
+                timeout=hard_limit
+            )
        elapsed = time.time() - t0
-        return code, elapsed
+        return result.returncode, elapsed
    except subprocess.TimeoutExpired:
        elapsed = time.time() - t0
        return -1, elapsed
@@ -141,24 +145,24 @@ def get_experiment_count(experiment_dir):
    return max(0, len(tsv.read_text().splitlines()) - 1)


-def get_last_active(root):
-    """Find the most recently modified experiment."""
-    latest = None
-    latest_time = 0
-    for domain_dir in root.iterdir():
-        if not domain_dir.is_dir() or domain_dir.name.startswith("."):
-            continue
-        for exp_dir in domain_dir.iterdir():
-            if not exp_dir.is_dir():
-                continue
-            cfg = exp_dir / "config.cfg"
-            if cfg.exists() and cfg.stat().st_mtime > latest_time:
-                latest_time = cfg.stat().st_mtime
-                latest = f"{domain_dir.name}/{exp_dir.name}"
-    return latest
+def get_description_from_diff(project_root):
+    """Auto-generate a description from git diff --stat HEAD~1."""
+    code, diff_stat, _ = run_git(["diff", "--stat", "HEAD~1"], cwd=str(project_root))
+    if code == 0 and diff_stat:
+        return diff_stat.split("\n")[0][:50]
+    return "experiment"


-def run_single(project_root, experiment_dir, config, exp_num, dry_run=False):
+def read_last_lines(filepath, n=5):
+    """Read last n lines of a file (replaces tail shell command)."""
+    path = Path(filepath)
+    if not path.exists():
+        return ""
+    lines = path.read_text().splitlines()
+    return "\n".join(lines[-n:])
+
+
+def run_single(project_root, experiment_dir, config, exp_num, dry_run=False, description=None):
    """Run one experiment iteration."""
    direction = config.get("metric_direction", "lower")
    metric_grep = config.get("metric_grep", "^metric:")
@@ -177,11 +181,9 @@ def run_single(project_root, experiment_dir, config, exp_num, dry_run=False):
        print("  [DRY RUN] Would run evaluation and check metric")
        return "dry_run"

-    # Save state for rollback
-    code, pre_commit, _ = run_cmd("git rev-parse HEAD", cwd=str(project_root))
-    if code != 0:
-        print("  Error: can't get git state")
-        return "error"
+    # Auto-generate description if not provided
+    if not description:
+        description = get_description_from_diff(str(project_root))

    # Run evaluation
    print(f"  Running: {eval_cmd} (budget: {time_budget}m)")
@@ -192,17 +194,17 @@ def run_single(project_root, experiment_dir, config, exp_num, dry_run=False):
    # Timeout
    if ret_code == -1:
        print(f"  TIMEOUT after {elapsed:.0f}s — discarding")
-        run_cmd("git checkout -- .", cwd=str(project_root))
-        run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
+        run_git(["checkout", "--", "."], cwd=str(project_root))
+        run_git(["reset", "--hard", "HEAD~1"], cwd=str(project_root))
        log_result(experiment_dir, commit, None, "crash", f"timeout_{elapsed:.0f}s")
        return "crash"

    # Crash
    if ret_code != 0:
-        _, tail, _ = run_cmd(f"tail -5 {log_file}", cwd=str(project_root))
+        tail = read_last_lines(log_file, 5)
        print(f"  CRASH (exit {ret_code}) after {elapsed:.0f}s")
        print(f"  Last output: {tail[:200]}")
-        run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
+        run_git(["reset", "--hard", "HEAD~1"], cwd=str(project_root))
        log_result(experiment_dir, commit, None, "crash", f"exit_{ret_code}")
        return "crash"

@@ -210,7 +212,7 @@ def run_single(project_root, experiment_dir, config, exp_num, dry_run=False):
    metric_val = extract_metric(log_file, metric_grep)
    if metric_val is None:
        print(f"  Could not parse {metric_name} from run.log")
-        run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
+        run_git(["reset", "--hard", "HEAD~1"], cwd=str(project_root))
        log_result(experiment_dir, commit, None, "crash", "metric_parse_failed")
        return "crash"

@@ -224,63 +226,23 @@ def run_single(project_root, experiment_dir, config, exp_num, dry_run=False):
    # Keep or discard
    if is_improvement(metric_val, best, direction):
        print(f"  KEEP — improvement")
-        log_result(experiment_dir, commit, metric_val, "keep",
-                   f"improved_{metric_name}_{metric_val:.4f}")
+        log_result(experiment_dir, commit, metric_val, "keep", description)
        return "keep"
    else:
        print(f"  DISCARD — no improvement")
-        run_cmd(f"git reset --hard {pre_commit}", cwd=str(project_root))
-        best_str = f"{best:.4f}" if best else "?"
+        run_git(["reset", "--hard", "HEAD~1"], cwd=str(project_root))
+        best_str = f"{best:.4f}" if best is not None else "?"
        log_result(experiment_dir, commit, metric_val, "discard",
                   f"no_improvement_{metric_val:.4f}_vs_{best_str}")
        return "discard"


-def print_summary(experiment_dir, config):
-    """Print session summary."""
-    tsv = experiment_dir / "results.tsv"
-    if not tsv.exists():
-        return
-    lines = tsv.read_text().splitlines()[1:]
-    if not lines:
-        return
-
-    keeps = [l for l in lines if "\tkeep\t" in l]
-    discards = [l for l in lines if "\tdiscard\t" in l]
-    crashes = [l for l in lines if "\tcrash\t" in l]
-    metric_name = config.get("metric", "metric")
-    direction = config.get("metric_direction", "lower")
-
-    print(f"\n{'=' * 55}")
-    print(f"  autoresearch — Session Summary")
-    print(f"  Experiments: {len(lines)} total")
-    print(f"  Keep: {len(keeps)} | Discard: {len(discards)} | Crash: {len(crashes)}")
-
-    if keeps:
-        try:
-            valid = []
-            for l in keeps:
-                parts = l.split("\t")
-                if parts[1] != "N/A":
-                    valid.append(float(parts[1]))
-            if len(valid) >= 2:
-                first, last = valid[0], valid[-1]
-                best = min(valid) if direction == "lower" else max(valid)
-                pct = ((first - best) / first * 100) if direction == "lower" else ((best - first) / first * 100)
-                print(f"  {metric_name}: {first:.6f} -> {best:.6f} ({pct:+.1f}%)")
-        except (ValueError, IndexError):
-            pass
-    print(f"{'=' * 55}\n")
-
-
 def main():
    parser = argparse.ArgumentParser(description="autoresearch-agent runner")
    parser.add_argument("--experiment", help="Experiment path: domain/name (e.g. engineering/api-speed)")
-    parser.add_argument("--resume", action="store_true", help="Resume last active experiment")
-    parser.add_argument("--loop", action="store_true", help="Run forever")
-    parser.add_argument("--single", action="store_true", help="Run one experiment")
+    parser.add_argument("--single", action="store_true", help="Run one experiment iteration")
    parser.add_argument("--dry-run", action="store_true", help="Show what would happen")
-    parser.add_argument("--max-experiments", type=int, default=0, help="Max experiments (0 = unlimited)")
+    parser.add_argument("--description", help="Description of the change (auto-generated from git diff if omitted)")
    parser.add_argument("--path", default=".", help="Project root")
    args = parser.parse_args()

@@ -291,20 +253,11 @@ def main():
        print("No .autoresearch/ found. Run setup_experiment.py first.")
        sys.exit(1)

-    # Resolve experiment
-    experiment_path = args.experiment
-    if args.resume:
-        experiment_path = get_last_active(root)
-        if not experiment_path:
-            print("No experiments found to resume.")
-            sys.exit(1)
-        print(f"Resuming: {experiment_path}")
-
-    if not experiment_path:
-        print("Specify --experiment domain/name or --resume")
+    if not args.experiment:
+        print("Specify --experiment domain/name")
        sys.exit(1)

-    experiment_dir = root / experiment_path
+    experiment_dir = root / args.experiment
    if not experiment_dir.exists():
        print(f"Experiment not found: {experiment_dir}")
        print("Run: python scripts/setup_experiment.py --list")
@@ -312,56 +265,15 @@ def main():

    config = load_config(experiment_dir)

-    domain, name = experiment_path.split("/", 1)
    print(f"\n  autoresearch-agent")
-    print(f"  Experiment: {experiment_path}")
+    print(f"  Experiment: {args.experiment}")
    print(f"  Target: {config.get('target', '?')}")
    print(f"  Metric: {config.get('metric', '?')} ({config.get('metric_direction', '?')} is better)")
    print(f"  Budget: {config.get('time_budget_minutes', '?')} min/experiment")
-    print(f"  Mode: {'loop' if args.loop else 'single'}")
+    print(f"  Mode: {'dry-run' if args.dry_run else 'single'}")

-    if args.single or args.dry_run:
-        exp_num = get_experiment_count(experiment_dir) + 1
-        run_single(project_root, experiment_dir, config, exp_num, args.dry_run)
-        return
-
-    if not args.loop:
-        print("\nSpecify --loop (forever) or --single (one experiment)")
-        sys.exit(1)
-
-    # Graceful shutdown
-    def handle_interrupt(sig, frame):
-        print_summary(experiment_dir, config)
-        print("\nStopped by user.")
-        sys.exit(0)
-
-    signal.signal(signal.SIGINT, handle_interrupt)
-    signal.signal(signal.SIGTERM, handle_interrupt)
-
-    consecutive_crashes = 0
    exp_num = get_experiment_count(experiment_dir) + 1
-
-    print(f"\nStarting loop. Ctrl+C to stop.\n")
-
-    while True:
-        result = run_single(project_root, experiment_dir, config, exp_num, False)
-        exp_num += 1
-
-        if result == "crash":
-            consecutive_crashes += 1
-        else:
-            consecutive_crashes = 0
-
-        if consecutive_crashes >= 5:
-            print("\n  5 consecutive crashes. Pausing.")
-            print("  Check .autoresearch/{}/run.log".format(experiment_path))
-            break
-
-        if 0 < args.max_experiments < exp_num:
-            print(f"\n  Reached max experiments ({args.max_experiments})")
-            break
-
-    print_summary(experiment_dir, config)
+    run_single(project_root, experiment_dir, config, exp_num, args.dry_run, args.description)


 if __name__ == "__main__":
--- a/engineering/autoresearch-agent/scripts/setup_experiment.py
+++ b/engineering/autoresearch-agent/scripts/setup_experiment.py
@@ -19,11 +19,9 @@ Usage:
 """

 import argparse
-import os
 import shutil
 import subprocess
 import sys
-import time
 from datetime import datetime
 from pathlib import Path

@@ -159,13 +157,19 @@ def copy_evaluator(experiment_dir, evaluator_name):
 def create_branch(path, domain, name):
    """Create and checkout the experiment branch."""
    branch = f"autoresearch/{domain}/{name}"
-    code, _, err = run_cmd(f"git checkout -b {branch}", cwd=path)
-    if code != 0:
-        if "already exists" in err:
+    result = subprocess.run(
+        ["git", "checkout", "-b", branch],
+        cwd=path, capture_output=True, text=True
+    )
+    if result.returncode != 0:
+        if "already exists" in result.stderr:
            print(f"  Branch '{branch}' already exists. Checking out...")
-            run_cmd(f"git checkout {branch}", cwd=path)
+            subprocess.run(
+                ["git", "checkout", branch],
+                cwd=path, capture_output=True, text=True
+            )
            return branch
-        print(f"  Warning: could not create branch: {err}")
+        print(f"  Warning: could not create branch: {result.stderr}")
        return None
    print(f"  Created branch: {branch}")
    return branch
@@ -229,10 +233,17 @@ def list_evaluators():
        # Read first docstring line
        desc = ""
        for line in f.read_text().splitlines():
-            if line.strip().startswith('"""') or line.strip().startswith("'''"):
+            stripped = line.strip()
+            if stripped.startswith('"""') or stripped.startswith("'''"):
+                quote = stripped[:3]
+                # Single-line docstring: """Description."""
+                after_quote = stripped[3:]
+                if after_quote and after_quote.rstrip(quote[0]).strip():
+                    desc = after_quote.rstrip('"').rstrip("'").strip()
+                    break
                continue
-            if line.strip() and not line.startswith("#!"):
-                desc = line.strip().strip('"').strip("'")
+            if stripped and not line.startswith("#!"):
+                desc = stripped.strip('"').strip("'")
                break
        print(f"  {f.stem:<25} {desc}")

@@ -252,7 +263,6 @@ def main():
                        help="Where to store experiments: project (./) or user (~/)")
    parser.add_argument("--constraints", default="", help="Additional constraints for program.md")
    parser.add_argument("--path", default=".", help="Project root path")
-    parser.add_argument("--skip-baseline", action="store_true", help="Skip baseline run")
    parser.add_argument("--skip-branch", action="store_true", help="Don't create git branch")
    parser.add_argument("--list", action="store_true", help="List all experiments")
    parser.add_argument("--list-evaluators", action="store_true", help="List available evaluators")
@@ -288,7 +298,11 @@ def main():
    print(f"  Time: {datetime.now().strftime('%Y-%m-%d %H:%M')}\n")

    # Check git
-    code, _, _ = run_cmd("git rev-parse --is-inside-work-tree", cwd=str(project_root))
+    result = subprocess.run(
+        ["git", "rev-parse", "--is-inside-work-tree"],
+        cwd=str(project_root), capture_output=True, text=True
+    )
+    code = result.returncode
    if code != 0:
        print("  Error: not a git repository. Run: git init && git add . && git commit -m 'initial'")
        sys.exit(1)
@@ -362,7 +376,7 @@ def main():
    if not args.skip_branch:
        print(f"  Branch: autoresearch/{args.domain}/{args.name}")
    print(f"\n  To start:")
-    print(f"  python scripts/run_experiment.py --experiment {args.domain}/{args.name} --loop")
+    print(f"  python scripts/run_experiment.py --experiment {args.domain}/{args.name} --single")


 if __name__ == "__main__":
--- a/engineering/autoresearch-agent/settings.json
+++ b/engineering/autoresearch-agent/settings.json
@@ -0,0 +1,22 @@
+{
+  "name": "autoresearch-agent",
+  "displayName": "Autoresearch Agent",
+  "version": "2.1.2",
+  "description": "Autonomous experiment loop — optimize any file by a measurable metric.",
+  "author": "Alireza Rezvani",
+  "license": "MIT",
+  "platforms": ["claude-code", "openclaw", "codex"],
+  "category": "engineering",
+  "tags": ["optimization", "experiments", "benchmarks", "autoresearch", "loop", "metrics"],
+  "repository": "https://github.com/alirezarezvani/claude-skills",
+  "commands": {
+    "setup": "/ar:setup",
+    "run": "/ar:run",
+    "loop": "/ar:loop",
+    "status": "/ar:status",
+    "resume": "/ar:resume"
+  },
+  "agents": [
+    "experiment-runner"
+  ]
+}
--- a/engineering/autoresearch-agent/skills/loop/SKILL.md
+++ b/engineering/autoresearch-agent/skills/loop/SKILL.md
@@ -0,0 +1,122 @@
+---
+name: "loop"
+description: "Start an autonomous experiment loop with user-selected interval (10min, 1h, daily, weekly, monthly). Uses CronCreate for scheduling."
+command: /ar:loop
+---
+
+# /ar:loop — Autonomous Experiment Loop
+
+Start a recurring experiment loop that runs at a user-selected interval.
+
+## Usage
+
+```
+/ar:loop engineering/api-speed             # Start loop (prompts for interval)
+/ar:loop engineering/api-speed 10m         # Every 10 minutes
+/ar:loop engineering/api-speed 1h          # Every hour
+/ar:loop engineering/api-speed daily       # Daily at ~9am
+/ar:loop engineering/api-speed weekly      # Weekly on Monday ~9am
+/ar:loop engineering/api-speed monthly     # Monthly on 1st ~9am
+/ar:loop stop engineering/api-speed        # Stop an active loop
+```
+
+## What It Does
+
+### Step 1: Resolve experiment
+
+If no experiment specified, list experiments and let user pick.
+
+### Step 2: Select interval
+
+If interval not provided as argument, present options:
+
+```
+Select loop interval:
+  1. Every 10 minutes  (rapid — stay and watch)
+  2. Every hour         (background — check back later)
+  3. Daily at ~9am      (overnight experiments)
+  4. Weekly on Monday   (long-running experiments)
+  5. Monthly on 1st     (slow experiments)
+```
+
+Map to cron expressions:
+
+| Interval | Cron Expression | Shorthand |
+|----------|----------------|-----------|
+| 10 minutes | `*/10 * * * *` | `10m` |
+| 1 hour | `7 * * * *` | `1h` |
+| Daily | `57 8 * * *` | `daily` |
+| Weekly | `57 8 * * 1` | `weekly` |
+| Monthly | `57 8 1 * *` | `monthly` |
+
+### Step 3: Create the recurring job
+
+Use `CronCreate` with this prompt (fill in the experiment details):
+
+```
+You are running autoresearch experiment "{domain}/{name}".
+
+1. Read .autoresearch/{domain}/{name}/config.cfg for: target, evaluate_cmd, metric, metric_direction
+2. Read .autoresearch/{domain}/{name}/program.md for strategy and constraints
+3. Read .autoresearch/{domain}/{name}/results.tsv for experiment history
+4. Run: git checkout autoresearch/{domain}/{name}
+
+Then do exactly ONE iteration:
+- Review results.tsv: what worked, what failed, what hasn't been tried
+- Edit the target file with ONE change (strategy escalation based on run count)
+- Commit: git add {target} && git commit -m "experiment: {description}"
+- Evaluate: python {skill_path}/scripts/run_experiment.py --experiment {domain}/{name} --single
+- Read the output (KEEP/DISCARD/CRASH)
+
+Rules:
+- ONE change per experiment
+- NEVER modify the evaluator
+- If 5 consecutive crashes in results.tsv, delete this cron job (CronDelete) and alert
+- After every 10 experiments, update Strategy section of program.md
+
+Current best metric: {read from results.tsv or "no baseline yet"}
+Total experiments so far: {count from results.tsv}
+```
+
+### Step 4: Store loop metadata
+
+Write to `.autoresearch/{domain}/{name}/loop.json`:
+
+```json
+{
+  "cron_id": "{id from CronCreate}",
+  "interval": "{user selection}",
+  "started": "{ISO timestamp}",
+  "experiment": "{domain}/{name}"
+}
+```
+
+### Step 5: Confirm to user
+
+```
+Loop started for {domain}/{name}
+  Interval: {interval description}
+  Cron ID: {id}
+  Auto-expires: 3 days (CronCreate limit)
+
+  To check progress: /ar:status
+  To stop the loop:  /ar:loop stop {domain}/{name}
+
+  Note: Recurring jobs auto-expire after 3 days.
+  Run /ar:loop again to restart after expiry.
+```
+
+## Stopping a Loop
+
+When user runs `/ar:loop stop {experiment}`:
+
+1. Read `.autoresearch/{domain}/{name}/loop.json` to get the cron ID
+2. Call `CronDelete` with that ID
+3. Delete `loop.json`
+4. Confirm: "Loop stopped for {experiment}. {n} experiments completed."
+
+## Important Limitations
+
+- **3-day auto-expiry**: CronCreate jobs expire after 3 days. For longer experiments, the user must re-run `/ar:loop` to restart. Results persist — the new loop picks up where the old one left off.
+- **One loop per experiment**: Don't start multiple loops for the same experiment.
+- **Concurrent experiments**: Multiple experiments can loop simultaneously ONLY if they're on different git branches (which they are by default — each experiment gets `autoresearch/{domain}/{name}`).
--- a/engineering/autoresearch-agent/skills/resume/SKILL.md
+++ b/engineering/autoresearch-agent/skills/resume/SKILL.md
@@ -0,0 +1,77 @@
+---
+name: "resume"
+description: "Resume a paused experiment. Checkout the experiment branch, read results history, continue iterating."
+command: /ar:resume
+---
+
+# /ar:resume — Resume Experiment
+
+Resume a paused or context-limited experiment. Reads all history and continues where you left off.
+
+## Usage
+
+```
+/ar:resume                                  # List experiments, let user pick
+/ar:resume engineering/api-speed            # Resume specific experiment
+```
+
+## What It Does
+
+### Step 1: List experiments if needed
+
+If no experiment specified:
+
+```bash
+python {skill_path}/scripts/setup_experiment.py --list
+```
+
+Show status for each (active/paused/done based on results.tsv age). Let user pick.
+
+### Step 2: Load full context
+
+```bash
+# Checkout the experiment branch
+git checkout autoresearch/{domain}/{name}
+
+# Read config
+cat .autoresearch/{domain}/{name}/config.cfg
+
+# Read strategy
+cat .autoresearch/{domain}/{name}/program.md
+
+# Read full results history
+cat .autoresearch/{domain}/{name}/results.tsv
+
+# Read recent git log for the branch
+git log --oneline -20
+```
+
+### Step 3: Report current state
+
+Summarize for the user:
+
+```
+Resuming: engineering/api-speed
+  Target: src/api/search.py
+  Metric: p50_ms (lower is better)
+  Experiments: 23 total — 8 kept, 12 discarded, 3 crashed
+  Best: 185ms (-42% from baseline of 320ms)
+  Last experiment: "added response caching" → KEEP (185ms)
+
+  Recent patterns:
+  - Caching changes: 3 kept, 1 discarded (consistently helpful)
+  - Algorithm changes: 2 discarded, 1 crashed (high risk, low reward so far)
+  - I/O optimization: 2 kept (promising direction)
+```
+
+### Step 4: Ask next action
+
+```
+How would you like to continue?
+  1. Single iteration (/ar:run)  — I'll make one change and evaluate
+  2. Start a loop (/ar:loop)     — Autonomous with scheduled interval
+  3. Just show me the results    — I'll review and decide
+```
+
+If the user picks loop, hand off to `/ar:loop` with the experiment pre-selected.
+If single, hand off to `/ar:run`.
--- a/engineering/autoresearch-agent/skills/run/SKILL.md
+++ b/engineering/autoresearch-agent/skills/run/SKILL.md
@@ -0,0 +1,84 @@
+---
+name: "run"
+description: "Run a single experiment iteration. Edit the target file, evaluate, keep or discard."
+command: /ar:run
+---
+
+# /ar:run — Single Experiment Iteration
+
+Run exactly ONE experiment iteration: review history, decide a change, edit, commit, evaluate.
+
+## Usage
+
+```
+/ar:run engineering/api-speed              # Run one iteration
+/ar:run                                     # List experiments, let user pick
+```
+
+## What It Does
+
+### Step 1: Resolve experiment
+
+If no experiment specified, run `python {skill_path}/scripts/setup_experiment.py --list` and ask the user to pick.
+
+### Step 2: Load context
+
+```bash
+# Read experiment config
+cat .autoresearch/{domain}/{name}/config.cfg
+
+# Read strategy and constraints
+cat .autoresearch/{domain}/{name}/program.md
+
+# Read experiment history
+cat .autoresearch/{domain}/{name}/results.tsv
+
+# Checkout the experiment branch
+git checkout autoresearch/{domain}/{name}
+```
+
+### Step 3: Decide what to try
+
+Review results.tsv:
+- What changes were kept? What pattern do they share?
+- What was discarded? Avoid repeating those approaches.
+- What crashed? Understand why.
+- How many runs so far? (Escalate strategy accordingly)
+
+**Strategy escalation:**
+- Runs 1-5: Low-hanging fruit (obvious improvements)
+- Runs 6-15: Systematic exploration (vary one parameter)
+- Runs 16-30: Structural changes (algorithm swaps)
+- Runs 30+: Radical experiments (completely different approaches)
+
+### Step 4: Make ONE change
+
+Edit only the target file specified in config.cfg. Change one thing. Keep it simple.
+
+### Step 5: Commit and evaluate
+
+```bash
+git add {target}
+git commit -m "experiment: {short description of what changed}"
+
+python {skill_path}/scripts/run_experiment.py \
+  --experiment {domain}/{name} --single
+```
+
+### Step 6: Report result
+
+Read the script output. Tell the user:
+- **KEEP**: "Improvement! {metric}: {value} ({delta} from previous best)"
+- **DISCARD**: "No improvement. {metric}: {value} vs best {best}. Reverted."
+- **CRASH**: "Evaluation failed: {reason}. Reverted."
+
+### Step 7: Self-improvement check
+
+After every 10th experiment (check results.tsv line count), update the Strategy section of program.md with patterns learned.
+
+## Rules
+
+- ONE change per iteration. Don't change 5 things at once.
+- NEVER modify the evaluator (evaluate.py). It's ground truth.
+- Simplicity wins. Equal performance with simpler code is an improvement.
+- No new dependencies.
--- a/engineering/autoresearch-agent/skills/setup/SKILL.md
+++ b/engineering/autoresearch-agent/skills/setup/SKILL.md
@@ -0,0 +1,77 @@
+---
+name: "setup"
+description: "Set up a new autoresearch experiment interactively. Collects domain, target file, eval command, metric, direction, and evaluator."
+command: /ar:setup
+---
+
+# /ar:setup — Create New Experiment
+
+Set up a new autoresearch experiment with all required configuration.
+
+## Usage
+
+```
+/ar:setup                                    # Interactive mode
+/ar:setup engineering api-speed src/api.py "pytest bench.py" p50_ms lower
+/ar:setup --list                             # Show existing experiments
+/ar:setup --list-evaluators                  # Show available evaluators
+```
+
+## What It Does
+
+### If arguments provided
+
+Pass them directly to the setup script:
+
+```bash
+python {skill_path}/scripts/setup_experiment.py \
+  --domain {domain} --name {name} \
+  --target {target} --eval "{eval_cmd}" \
+  --metric {metric} --direction {direction} \
+  [--evaluator {evaluator}] [--scope {scope}]
+```
+
+### If no arguments (interactive mode)
+
+Collect each parameter one at a time:
+
+1. **Domain** — Ask: "What domain? (engineering, marketing, content, prompts, custom)"
+2. **Name** — Ask: "Experiment name? (e.g., api-speed, blog-titles)"
+3. **Target file** — Ask: "Which file to optimize?" Verify it exists.
+4. **Eval command** — Ask: "How to measure it? (e.g., pytest bench.py, python evaluate.py)"
+5. **Metric** — Ask: "What metric does the eval output? (e.g., p50_ms, ctr_score)"
+6. **Direction** — Ask: "Is lower or higher better?"
+7. **Evaluator** (optional) — Show built-in evaluators. Ask: "Use a built-in evaluator, or your own?"
+8. **Scope** — Ask: "Store in project (.autoresearch/) or user (~/.autoresearch/)?"
+
+Then run `setup_experiment.py` with the collected parameters.
+
+### Listing
+
+```bash
+# Show existing experiments
+python {skill_path}/scripts/setup_experiment.py --list
+
+# Show available evaluators
+python {skill_path}/scripts/setup_experiment.py --list-evaluators
+```
+
+## Built-in Evaluators
+
+| Name | Metric | Use Case |
+|------|--------|----------|
+| `benchmark_speed` | `p50_ms` (lower) | Function/API execution time |
+| `benchmark_size` | `size_bytes` (lower) | File, bundle, Docker image size |
+| `test_pass_rate` | `pass_rate` (higher) | Test suite pass percentage |
+| `build_speed` | `build_seconds` (lower) | Build/compile/Docker build time |
+| `memory_usage` | `peak_mb` (lower) | Peak memory during execution |
+| `llm_judge_content` | `ctr_score` (higher) | Headlines, titles, descriptions |
+| `llm_judge_prompt` | `quality_score` (higher) | System prompts, agent instructions |
+| `llm_judge_copy` | `engagement_score` (higher) | Social posts, ad copy, emails |
+
+## After Setup
+
+Report to the user:
+- Experiment path and branch name
+- Whether the eval command worked and the baseline metric
+- Suggest: "Run `/ar:run {domain}/{name}` to start iterating, or `/ar:loop {domain}/{name}` for autonomous mode."
--- a/engineering/autoresearch-agent/skills/status/SKILL.md
+++ b/engineering/autoresearch-agent/skills/status/SKILL.md
@@ -0,0 +1,71 @@
+---
+name: "status"
+description: "Show experiment dashboard with results, active loops, and progress."
+command: /ar:status
+---
+
+# /ar:status — Experiment Dashboard
+
+Show experiment results, active loops, and progress across all experiments.
+
+## Usage
+
+```
+/ar:status                                  # Full dashboard
+/ar:status engineering/api-speed            # Single experiment detail
+/ar:status --domain engineering             # All experiments in a domain
+/ar:status --format markdown                # Export as markdown
+/ar:status --format csv --output results.csv  # Export as CSV
+```
+
+## What It Does
+
+### Single experiment
+
+```bash
+python {skill_path}/scripts/log_results.py --experiment {domain}/{name}
+```
+
+Also check for active loop:
+```bash
+cat .autoresearch/{domain}/{name}/loop.json 2>/dev/null
+```
+
+If loop.json exists, show:
+```
+Active loop: every {interval} (cron ID: {id}, started: {date})
+```
+
+### Domain view
+
+```bash
+python {skill_path}/scripts/log_results.py --domain {domain}
+```
+
+### Full dashboard
+
+```bash
+python {skill_path}/scripts/log_results.py --dashboard
+```
+
+For each experiment, also check for loop.json and show loop status.
+
+### Export
+
+```bash
+# CSV
+python {skill_path}/scripts/log_results.py --dashboard --format csv --output {file}
+
+# Markdown
+python {skill_path}/scripts/log_results.py --dashboard --format markdown --output {file}
+```
+
+## Output Example
+
+```
+DOMAIN          EXPERIMENT          RUNS  KEPT  BEST         CHANGE    STATUS   LOOP
+engineering     api-speed            47    14   185ms        -76.9%    active   every 1h
+engineering     bundle-size          23     8   412KB        -58.3%    paused   —
+marketing       medium-ctr           31    11   8.4/10       +68.0%    active   daily
+prompts         support-tone         15     6   82/100       +46.4%    done     —
+```