Evaluate Agent Results by Metric is a ai-agents claude skill built by Alireza Rezvani. Best for: AI development teams use this to automatically benchmark and rank competing agent solutions by performance metrics or qualitative assessment..
- What it does
- Evaluate and rank agent results using metrics, LLM judge comparison, or hybrid approach for AgentHub sessions.
- Category
- ai-agents
- Created by
- Alireza Rezvani
- Last updated
Evaluate Agent Results by Metric
Evaluate and rank agent results using metrics, LLM judge comparison, or hybrid approach for AgentHub sessions.
Skill instructions
name: "eval" description: "Evaluate and rank agent results by metric or LLM judge for an AgentHub session." command: /hub:eval
/hub:eval — Evaluate Agent Results
Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.
Usage
/hub:eval # Eval latest session using configured criteria
/hub:eval 20260317-143022 # Eval specific session
/hub:eval --judge # Force LLM judge mode (ignore metric config)
What It Does
Metric Mode (eval command configured)
Run the evaluation command in each agent's worktree:
python {skill_path}/scripts/result_ranker.py \
--session {session-id} \
--eval-cmd "{eval_cmd}" \
--metric {metric} --direction {direction}
Output:
RANK AGENT METRIC DELTA FILES
1 agent-2 142ms -38ms 2
2 agent-1 165ms -15ms 3
3 agent-3 190ms +10ms 1
Winner: agent-2 (142ms)
LLM Judge Mode (no eval command, or --judge flag)
For each agent:
- Get the diff:
git diff {base_branch}...{agent_branch} - Read the agent's result post from
.agenthub/board/results/agent-{i}-result.md - Compare all diffs and rank by:
- Correctness — Does it solve the task?
- Simplicity — Fewer lines changed is better (when equal correctness)
- Quality — Clean execution, good structure, no regressions
Present rankings with justification.
Example LLM judge output for a content task:
RANK AGENT VERDICT WORD COUNT
1 agent-1 Strong narrative, clear CTA 1480
2 agent-3 Good data points, weak intro 1520
3 agent-2 Generic tone, no differentiation 1350
Winner: agent-1 (strongest narrative arc and call-to-action)
Hybrid Mode
- Run metric evaluation first
- If top agents are within 10% of each other, use LLM judge to break ties
- Present both metric and qualitative rankings
After Eval
- Update session state:
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating
- Tell the user:
- Ranked results with winner highlighted
- Next step:
/hub:mergeto merge the winner - Or
/hub:merge {session-id} --agent {winner}to be explicit
Use this skill
Most skills are portable instruction packages. Claude Code supports SKILL.md directly. Other agents can use adapted files like AGENTS.md, .cursorrules, and GEMINI.md.
Claude Code
Save SKILL.md into your Claude Skills folder, then restart Claude Code.
mkdir -p ~/.claude/skills/evaluate-agent-results-by-metric && curl -L "https://raw.githubusercontent.com/alirezarezvani/claude-skills/HEAD/engineering/agenthub/skills/eval/SKILL.md" -o ~/.claude/skills/evaluate-agent-results-by-metric/SKILL.mdInstalls to ~/.claude/skills/evaluate-agent-results-by-metric/SKILL.md.
Use cases
AI development teams use this to automatically benchmark and rank competing agent solutions by performance metrics or qualitative assessment.
Reviews
No reviews yet. Be the first to review this skill.
No signup required