Who created Evaluate Agent Results by Metric?

Evaluate Agent Results by Metric was created by Alireza Rezvani. Browse their full portfolio at https://notonproducthunt.com/creator/alirezarezvani.

Evaluate Agent Results by Metric is a ai-agents claude skill built by Alireza Rezvani. Best for: AI development teams use this to automatically benchmark and rank competing agent solutions by performance metrics or qualitative assessment..

What it does: Evaluate and rank agent results using metrics, LLM judge comparison, or hybrid approach for AgentHub sessions.
Category: ai-agents
Created by: Alireza Rezvani
Last updated: March 27, 2026

Claude Skillai-agents GitHub-backed CuratedintermediateClaude Code

Evaluate Agent Results by Metric

Name: Evaluate Agent Results by Metric
Availability: InStock
Author: Alireza Rezvani

Evaluate and rank agent results using metrics, LLM judge comparison, or hybrid approach for AgentHub sessions.

Skill instructions

name: "eval" description: "Evaluate and rank agent results by metric or LLM judge for an AgentHub session." command: /hub:eval

/hub:eval — Evaluate Agent Results

Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.

Usage

/hub:eval                           # Eval latest session using configured criteria
/hub:eval 20260317-143022           # Eval specific session
/hub:eval --judge                   # Force LLM judge mode (ignore metric config)

What It Does

Metric Mode (eval command configured)

Run the evaluation command in each agent's worktree:

python {skill_path}/scripts/result_ranker.py \
  --session {session-id} \
  --eval-cmd "{eval_cmd}" \
  --metric {metric} --direction {direction}

Output:

RANK  AGENT       METRIC      DELTA      FILES
1     agent-2     142ms       -38ms      2
2     agent-1     165ms       -15ms      3
3     agent-3     190ms       +10ms      1

Winner: agent-2 (142ms)

LLM Judge Mode (no eval command, or --judge flag)

For each agent:

Get the diff: git diff {base_branch}...{agent_branch}
Read the agent's result post from .agenthub/board/results/agent-{i}-result.md
Compare all diffs and rank by:
- Correctness — Does it solve the task?
- Simplicity — Fewer lines changed is better (when equal correctness)
- Quality — Clean execution, good structure, no regressions

Present rankings with justification.

Example LLM judge output for a content task:

RANK  AGENT    VERDICT                               WORD COUNT
1     agent-1  Strong narrative, clear CTA            1480
2     agent-3  Good data points, weak intro           1520
3     agent-2  Generic tone, no differentiation       1350

Winner: agent-1 (strongest narrative arc and call-to-action)

Hybrid Mode

Run metric evaluation first
If top agents are within 10% of each other, use LLM judge to break ties
Present both metric and qualitative rankings

After Eval

Update session state:

python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating

Tell the user:
- Ranked results with winner highlighted
- Next step: /hub:merge to merge the winner
- Or /hub:merge {session-id} --agent {winner} to be explicit

View raw SKILL.md on GitHub

Use this skill

Most skills are portable instruction packages. Claude Code supports SKILL.md directly. Other agents can use adapted files like AGENTS.md, .cursorrules, and GEMINI.md.

Claude Code

Save SKILL.md into your Claude Skills folder, then restart Claude Code.

mkdir -p ~/.claude/skills/evaluate-agent-results-by-metric && curl -L "https://raw.githubusercontent.com/alirezarezvani/claude-skills/HEAD/engineering/agenthub/skills/eval/SKILL.md" -o ~/.claude/skills/evaluate-agent-results-by-metric/SKILL.md

Installs to ~/.claude/skills/evaluate-agent-results-by-metric/SKILL.md.

Use cases

AI development teams use this to automatically benchmark and rank competing agent solutions by performance metrics or qualitative assessment.

Reviews

No reviews yet. Be the first to review this skill.

No signup required

Stats

Installs0

GitHub Stars11.6k

Forks1507

LicenseMIT

UpdatedMar 27, 2026

Creator

Alireza Rezvani

@alirezarezvani

View on GitHub