refactor: autoresearch-agent v2.0 — multi-experiment, multi-domain, real-world evaluators

Major rewrite based on deep study of Karpathy's autoresearch repo. Architecture changes: - Multi-experiment support: .autoresearch/{domain}/{name}/ structure - Domain categories: engineering, marketing, content, prompts, custom - Project-level (git-tracked, shareable) or user-level (~/.autoresearch/) scope - User chooses scope during setup, not installation New evaluators (8 ready-to-use): - Free: benchmark_speed, benchmark_size, test_pass_rate, build_speed, memory_usage - LLM judge (uses existing subscription): llm_judge_content, llm_judge_prompt, llm_judge_copy - LLM judges call user's CLI tool (claude/codex/gemini) — no extra API keys needed Script improvements: - setup_experiment.py: --domain, --scope, --evaluator, --list, --list-evaluators - run_experiment.py: --experiment domain/name, --resume, --loop, --single - log_results.py: --dashboard, --domain, --format csv|markdown|terminal, --output Results export: - Terminal (default), CSV, and Markdown formats - Per-experiment, per-domain, or cross-experiment dashboard view SKILL.md rewritten: - Clear activation triggers (when the skill should activate) - Practical examples for each domain - Evaluator documentation with cost transparency - Simplified loop protocol matching Karpathy's original philosophy
2026-03-13 08:22:14 +01:00
parent c834d71a44
commit 12591282da
13 changed files with 1744 additions and 702 deletions
--- a/engineering/autoresearch-agent/evaluators/memory_usage.py
+++ b/engineering/autoresearch-agent/evaluators/memory_usage.py
@@ -0,0 +1,52 @@
+#!/usr/bin/env python3
+"""Measure peak memory usage of a command.
+DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
+
+import os
+import platform
+import subprocess
+import sys
+
+# --- CONFIGURE THESE ---
+COMMAND = "python src/module.py"  # Command to measure
+# --- END CONFIG ---
+
+system = platform.system()
+
+if system == "Linux":
+    # Use /usr/bin/time for peak RSS
+    result = subprocess.run(
+        f"/usr/bin/time -v {COMMAND}",
+        shell=True, capture_output=True, text=True, timeout=300
+    )
+    output = result.stderr
+    for line in output.splitlines():
+        if "Maximum resident set size" in line:
+            kb = int(line.split(":")[-1].strip())
+            mb = kb / 1024
+            print(f"peak_mb: {mb:.1f}")
+            print(f"peak_kb: {kb}")
+            sys.exit(0)
+    print("Could not parse memory from /usr/bin/time output", file=sys.stderr)
+    sys.exit(1)
+
+elif system == "Darwin":
+    # macOS: use /usr/bin/time -l
+    result = subprocess.run(
+        f"/usr/bin/time -l {COMMAND}",
+        shell=True, capture_output=True, text=True, timeout=300
+    )
+    output = result.stderr
+    for line in output.splitlines():
+        if "maximum resident set size" in line.lower():
+            # macOS reports in bytes
+            val = int(line.strip().split()[0])
+            mb = val / (1024 * 1024)
+            print(f"peak_mb: {mb:.1f}")
+            sys.exit(0)
+    print("Could not parse memory from time output", file=sys.stderr)
+    sys.exit(1)
+
+else:
+    print(f"Unsupported platform: {system}. Use Linux or macOS.", file=sys.stderr)
+    sys.exit(1)