refactor: autoresearch-agent v2.0 — multi-experiment, multi-domain, real-world evaluators

Major rewrite based on deep study of Karpathy's autoresearch repo. Architecture changes: - Multi-experiment support: .autoresearch/{domain}/{name}/ structure - Domain categories: engineering, marketing, content, prompts, custom - Project-level (git-tracked, shareable) or user-level (~/.autoresearch/) scope - User chooses scope during setup, not installation New evaluators (8 ready-to-use): - Free: benchmark_speed, benchmark_size, test_pass_rate, build_speed, memory_usage - LLM judge (uses existing subscription): llm_judge_content, llm_judge_prompt, llm_judge_copy - LLM judges call user's CLI tool (claude/codex/gemini) — no extra API keys needed Script improvements: - setup_experiment.py: --domain, --scope, --evaluator, --list, --list-evaluators - run_experiment.py: --experiment domain/name, --resume, --loop, --single - log_results.py: --dashboard, --domain, --format csv|markdown|terminal, --output Results export: - Terminal (default), CSV, and Markdown formats - Per-experiment, per-domain, or cross-experiment dashboard view SKILL.md rewritten: - Clear activation triggers (when the skill should activate) - Practical examples for each domain - Evaluator documentation with cost transparency - Simplified loop protocol matching Karpathy's original philosophy
2026-03-13 08:22:14 +01:00
parent c834d71a44
commit 12591282da
13 changed files with 1744 additions and 702 deletions
--- a/engineering/autoresearch-agent/evaluators/test_pass_rate.py
+++ b/engineering/autoresearch-agent/evaluators/test_pass_rate.py
@@ -0,0 +1,55 @@
+#!/usr/bin/env python3
+"""Measure test suite pass rate.
+DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
+
+import re
+import subprocess
+import sys
+
+# --- CONFIGURE THESE ---
+TEST_CMD = "pytest tests/ --tb=no -q"  # Test command
+# --- END CONFIG ---
+
+result = subprocess.run(TEST_CMD, shell=True, capture_output=True, text=True, timeout=300)
+output = result.stdout + "\n" + result.stderr
+
+# Try to parse pytest output: "X passed, Y failed, Z errors"
+passed = failed = errors = 0
+
+# pytest short format: "5 passed, 2 failed in 1.23s"
+match = re.search(r"(\d+) passed", output)
+if match:
+    passed = int(match.group(1))
+match = re.search(r"(\d+) failed", output)
+if match:
+    failed = int(match.group(1))
+match = re.search(r"(\d+) error", output)
+if match:
+    errors = int(match.group(1))
+
+total = passed + failed + errors
+if total == 0:
+    # Try unittest format: "Ran X tests"
+    match = re.search(r"Ran (\d+) test", output)
+    if match:
+        total = int(match.group(1))
+        if result.returncode == 0:
+            passed = total
+        else:
+            # Count failures from output
+            fail_match = re.search(r"FAILED \(failures=(\d+)", output)
+            if fail_match:
+                failed = int(fail_match.group(1))
+                passed = total - failed
+
+if total == 0:
+    print("Could not parse test results", file=sys.stderr)
+    print(f"Output: {output[:500]}", file=sys.stderr)
+    sys.exit(1)
+
+rate = passed / total
+
+print(f"pass_rate: {rate:.4f}")
+print(f"passed: {passed}")
+print(f"failed: {failed}")
+print(f"total: {total}")