refactor: autoresearch-agent v2.0 — multi-experiment, multi-domain, real-world evaluators
Major rewrite based on deep study of Karpathy's autoresearch repo.
Architecture changes:
- Multi-experiment support: .autoresearch/{domain}/{name}/ structure
- Domain categories: engineering, marketing, content, prompts, custom
- Project-level (git-tracked, shareable) or user-level (~/.autoresearch/) scope
- User chooses scope during setup, not installation
New evaluators (8 ready-to-use):
- Free: benchmark_speed, benchmark_size, test_pass_rate, build_speed, memory_usage
- LLM judge (uses existing subscription): llm_judge_content, llm_judge_prompt, llm_judge_copy
- LLM judges call user's CLI tool (claude/codex/gemini) — no extra API keys needed
Script improvements:
- setup_experiment.py: --domain, --scope, --evaluator, --list, --list-evaluators
- run_experiment.py: --experiment domain/name, --resume, --loop, --single
- log_results.py: --dashboard, --domain, --format csv|markdown|terminal, --output
Results export:
- Terminal (default), CSV, and Markdown formats
- Per-experiment, per-domain, or cross-experiment dashboard view
SKILL.md rewritten:
- Clear activation triggers (when the skill should activate)
- Practical examples for each domain
- Evaluator documentation with cost transparency
- Simplified loop protocol matching Karpathy's original philosophy
This commit is contained in:
52
engineering/autoresearch-agent/evaluators/memory_usage.py
Normal file
52
engineering/autoresearch-agent/evaluators/memory_usage.py
Normal file
@@ -0,0 +1,52 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Measure peak memory usage of a command.
|
||||
DO NOT MODIFY after experiment starts — this is the fixed evaluator."""
|
||||
|
||||
import os
|
||||
import platform
|
||||
import subprocess
|
||||
import sys
|
||||
|
||||
# --- CONFIGURE THESE ---
|
||||
COMMAND = "python src/module.py" # Command to measure
|
||||
# --- END CONFIG ---
|
||||
|
||||
system = platform.system()
|
||||
|
||||
if system == "Linux":
|
||||
# Use /usr/bin/time for peak RSS
|
||||
result = subprocess.run(
|
||||
f"/usr/bin/time -v {COMMAND}",
|
||||
shell=True, capture_output=True, text=True, timeout=300
|
||||
)
|
||||
output = result.stderr
|
||||
for line in output.splitlines():
|
||||
if "Maximum resident set size" in line:
|
||||
kb = int(line.split(":")[-1].strip())
|
||||
mb = kb / 1024
|
||||
print(f"peak_mb: {mb:.1f}")
|
||||
print(f"peak_kb: {kb}")
|
||||
sys.exit(0)
|
||||
print("Could not parse memory from /usr/bin/time output", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
elif system == "Darwin":
|
||||
# macOS: use /usr/bin/time -l
|
||||
result = subprocess.run(
|
||||
f"/usr/bin/time -l {COMMAND}",
|
||||
shell=True, capture_output=True, text=True, timeout=300
|
||||
)
|
||||
output = result.stderr
|
||||
for line in output.splitlines():
|
||||
if "maximum resident set size" in line.lower():
|
||||
# macOS reports in bytes
|
||||
val = int(line.strip().split()[0])
|
||||
mb = val / (1024 * 1024)
|
||||
print(f"peak_mb: {mb:.1f}")
|
||||
sys.exit(0)
|
||||
print("Could not parse memory from time output", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
|
||||
else:
|
||||
print(f"Unsupported platform: {system}. Use Linux or macOS.", file=sys.stderr)
|
||||
sys.exit(1)
|
||||
Reference in New Issue
Block a user