- AgentHub: 13 files updated with non-engineering examples (content drafts, research, strategy) — engineering stays primary, cross-domain secondary - AgentHub: 7 slash commands, 5 Python scripts, 3 references, 1 agent, dry_run.py validation (57 checks) - Marketplace: agenthub entry added with cross-domain keywords, engineering POWERFUL updated (25→30), product (12→13), counts synced across all configs - SEO: generate-docs.py now produces keyword-rich <title> tags and meta descriptions using SKILL.md frontmatter — "Claude Code Skills" in site_name propagates to all 276 HTML pages - SEO: per-domain title suffixes (Agent Skill for Codex & OpenClaw, etc.), slug-as-title cleanup, domain label stripping from titles - Broken links: 141→0 warnings — new rewrite_skill_internal_links() converts references/, scripts/, assets/ links to GitHub source URLs; skills/index.md phantom slugs fixed (6 marketing, 7 RA/QM) - Counts synced: 204 skills, 266 tools, 382 refs, 16 agents, 17 commands, 21 plugins — consistent across CLAUDE.md, README.md, docs/index.md, marketplace.json, getting-started.md, mkdocs.yml - Platform sync: Codex 163 skills, Gemini 246 items, OpenClaw compatible Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
91 lines
2.9 KiB
Markdown
91 lines
2.9 KiB
Markdown
---
|
|
title: "/hub:eval — Evaluate Agent Results — Agent Skill for Codex & OpenClaw"
|
|
description: "Evaluate and rank agent results by metric or LLM judge for an AgentHub session. Agent skill for Claude Code, Codex CLI, Gemini CLI, OpenClaw."
|
|
---
|
|
|
|
# /hub:eval — Evaluate Agent Results
|
|
|
|
<div class="page-meta" markdown>
|
|
<span class="meta-badge">:material-rocket-launch: Engineering - POWERFUL</span>
|
|
<span class="meta-badge">:material-identifier: `eval`</span>
|
|
<span class="meta-badge">:material-github: <a href="https://github.com/alirezarezvani/claude-skills/tree/main/engineering/agenthub/skills/eval/SKILL.md">Source</a></span>
|
|
</div>
|
|
|
|
<div class="install-banner" markdown>
|
|
<span class="install-label">Install:</span> <code>claude /plugin install engineering-advanced-skills</code>
|
|
</div>
|
|
|
|
|
|
Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.
|
|
|
|
## Usage
|
|
|
|
```
|
|
/hub:eval # Eval latest session using configured criteria
|
|
/hub:eval 20260317-143022 # Eval specific session
|
|
/hub:eval --judge # Force LLM judge mode (ignore metric config)
|
|
```
|
|
|
|
## What It Does
|
|
|
|
### Metric Mode (eval command configured)
|
|
|
|
Run the evaluation command in each agent's worktree:
|
|
|
|
```bash
|
|
python {skill_path}/scripts/result_ranker.py \
|
|
--session {session-id} \
|
|
--eval-cmd "{eval_cmd}" \
|
|
--metric {metric} --direction {direction}
|
|
```
|
|
|
|
Output:
|
|
```
|
|
RANK AGENT METRIC DELTA FILES
|
|
1 agent-2 142ms -38ms 2
|
|
2 agent-1 165ms -15ms 3
|
|
3 agent-3 190ms +10ms 1
|
|
|
|
Winner: agent-2 (142ms)
|
|
```
|
|
|
|
### LLM Judge Mode (no eval command, or --judge flag)
|
|
|
|
For each agent:
|
|
1. Get the diff: `git diff {base_branch}...{agent_branch}`
|
|
2. Read the agent's result post from `.agenthub/board/results/agent-{i}-result.md`
|
|
3. Compare all diffs and rank by:
|
|
- **Correctness** — Does it solve the task?
|
|
- **Simplicity** — Fewer lines changed is better (when equal correctness)
|
|
- **Quality** — Clean execution, good structure, no regressions
|
|
|
|
Present rankings with justification.
|
|
|
|
Example LLM judge output for a content task:
|
|
```
|
|
RANK AGENT VERDICT WORD COUNT
|
|
1 agent-1 Strong narrative, clear CTA 1480
|
|
2 agent-3 Good data points, weak intro 1520
|
|
3 agent-2 Generic tone, no differentiation 1350
|
|
|
|
Winner: agent-1 (strongest narrative arc and call-to-action)
|
|
```
|
|
|
|
### Hybrid Mode
|
|
|
|
1. Run metric evaluation first
|
|
2. If top agents are within 10% of each other, use LLM judge to break ties
|
|
3. Present both metric and qualitative rankings
|
|
|
|
## After Eval
|
|
|
|
1. Update session state:
|
|
```bash
|
|
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating
|
|
```
|
|
|
|
2. Tell the user:
|
|
- Ranked results with winner highlighted
|
|
- Next step: `/hub:merge` to merge the winner
|
|
- Or `/hub:merge {session-id} --agent {winner}` to be explicit
|