Who created Agent Eval Benchmark Tool?

Agent Eval Benchmark Tool was created by Affaan M. Browse their full portfolio at https://notonproducthunt.com/creator/affaan-m.

Agent Eval Benchmark Tool is a ai-agents claude skill built by Affaan M. Best for: Teams evaluating which coding agent (Claude Code, Aider, etc.) to adopt by running reproducible task benchmarks..

What it does: Compare coding agents head-to-head on custom tasks with pass rate, cost, time, and consistency metrics.
Category: ai-agents
Created by: Affaan M
Last updated: March 27, 2026

Claude Skillai-agents GitHub-backed CuratedintermediateClaude Code

Agent Eval Benchmark Tool

Name: Agent Eval Benchmark Tool
Availability: InStock
Author: Affaan M

Compare coding agents head-to-head on custom tasks with pass rate, cost, time, and consistency metrics.

Skill instructions

name: agent-eval description: Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics origin: ECC tools: Read, Write, Edit, Bash, Grep, Glob

Agent Eval Skill

A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes — this tool systematizes it.

When to Activate

Comparing coding agents (Claude Code, Aider, Codex, etc.) on your own codebase
Measuring agent performance before adopting a new tool or model
Running regression checks when an agent updates its model or tooling
Producing data-backed agent selection decisions for a team

Installation

Note: Install agent-eval from its repository after reviewing the source.

Core Concepts

YAML Task Definitions

Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:

name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
  - src/http_client.py
prompt: |
  Add retry logic with exponential backoff to all HTTP requests.
  Max 3 retries. Initial delay 1s, max delay 30s.
judge:
  - type: pytest
    command: pytest tests/test_http_client.py -v
  - type: grep
    pattern: "exponential_backoff|retry"
    files: src/http_client.py
commit: "abc1234"  # pin to specific commit for reproducibility

Git Worktree Isolation

Each agent run gets its own git worktree — no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.

Metrics Collected

| Metric | What It Measures | |--------|-----------------| | Pass rate | Did the agent produce code that passes the judge? | | Cost | API spend per task (when available) | | Time | Wall-clock seconds to completion | | Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) |

Workflow

1. Define Tasks

Create a tasks/ directory with YAML files, one per task:

mkdir tasks
# Write task definitions (see template above)

2. Run Agents

Execute agents against your tasks:

agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3

Each run:

Creates a fresh git worktree from the specified commit
Hands the prompt to the agent
Runs the judge criteria
Records pass/fail, cost, and time

3. Compare Results

Generate a comparison report:

agent-eval report --format table

Task: add-retry-logic (3 runs each)
┌──────────────┬───────────┬────────┬────────┬─────────────┐
│ Agent        │ Pass Rate │ Cost   │ Time   │ Consistency │
├──────────────┼───────────┼────────┼────────┼─────────────┤
│ claude-code  │ 3/3       │ $0.12  │ 45s    │ 100%        │
│ aider        │ 2/3       │ $0.08  │ 38s    │  67%        │
└──────────────┴───────────┴────────┴────────┴─────────────┘

Judge Types

Code-Based (deterministic)

judge:
  - type: pytest
    command: pytest tests/ -v
  - type: command
    command: npm run build

Pattern-Based

judge:
  - type: grep
    pattern: "class.*Retry"
    files: src/**/*.py

Model-Based (LLM-as-judge)

judge:
  - type: llm
    prompt: |
      Does this implementation correctly handle exponential backoff?
      Check for: max retries, increasing delays, jitter.

Best Practices

Start with 3-5 tasks that represent your real workload, not toy examples
Run at least 3 trials per agent to capture variance — agents are non-deterministic
Pin the commit in your task YAML so results are reproducible across days/weeks
Include at least one deterministic judge (tests, build) per task — LLM judges add noise
Track cost alongside pass rate — a 95% agent at 10x the cost may not be the right choice
Version your task definitions — they are test fixtures, treat them as code

Use this skill

Most skills are portable instruction packages. Claude Code supports SKILL.md directly. Other agents can use adapted files like AGENTS.md, .cursorrules, and GEMINI.md.

Claude Code

Save SKILL.md into your Claude Skills folder, then restart Claude Code.

mkdir -p ~/.claude/skills/agent-eval-benchmark-tool && curl -L "https://raw.githubusercontent.com/affaan-m/everything-claude-code/HEAD/skills/agent-eval/SKILL.md" -o ~/.claude/skills/agent-eval-benchmark-tool/SKILL.md

Installs to ~/.claude/skills/agent-eval-benchmark-tool/SKILL.md.

Use cases

Teams evaluating which coding agent (Claude Code, Aider, etc.) to adopt by running reproducible task benchmarks.

Reviews

No reviews yet. Be the first to review this skill.

No signup required

Stats

Installs0

GitHub Stars174.9k

Forks27058

LicenseMIT

UpdatedMar 27, 2026

Creator

Affaan M

@affaan-m

View on GitHub