Extract and Summarize Experiment Metrics is a research Claude Skill built by Diego Rodrigues de Sa e Souza. Best for: ML researchers automatically consolidate training metrics and evaluation results across multiple fine-tuned runs into a single readable summary after experiments complete..
Parse fine-tuning loss and evaluation accuracy from completed experiments into a structured summary.md file.
Generate a summary.md file capturing key metrics from a completed experiment. Think R's summary() for experiment results.
Create a lightweight summary of experiment results:
parse_eval_log.py script requires inspect-ai. Activate the conda environment from claude.local.md before running extraction commands.Find the experiment directory:
Read experiment_summary.yaml to identify runs:
From runs: section:
name: Run identifiertype: "fine-tuned" or "control"model: Model nameparameters: Dict of hyperparameters (empty for control runs)From evaluation.matrix: section:
run: Run nametasks: List of evaluation task namesepochs: List of epochs to evaluate (null for control runs)Determine status by checking filesystem:
{output_base}/ck-out-{run_name}/ and SLURM outputs{run_dir}/eval/logs/*.eval filesFor each COMPLETED fine-tuning run:
output_dir_base{output_dir_base}/ck-out-{run_name}/slurm-*.out(\d+)\|(\d+)\|Loss: ([0-9.]+)
{epoch}|{step}|Loss: {value}Note: Training SLURM outputs are in the output directory, NOT the run directory.
If SLURM stdout missing:
For each COMPLETED evaluation:
{run_dir}/eval/logs/*.evalpython tools/inspect/parse_eval_log.py {path}
summary_binary.py to get balanced accuracy and F1Script output format:
{
"status": "success",
"task": "capitalization",
"accuracy": 0.85,
"samples": 100,
"scorer": "exact_match",
"model": "..."
}
The .eval files don't currently store epoch information directly. To reliably map each evaluation to its epoch:
{run_dir}/eval/slurm-*.outslurm-2773062.out → job ID 2773062)sacct -j {job_ids} --format=JobID,JobName%50
eval-{task}-{run}-ep{N}:
eval-general_eval-lowlr-ep0 → epoch 0eval-general_eval-lowlr-ep9 → epoch 9grep -oP 'match/accuracy: \K[0-9.]+' slurm-{jobid}.out
Example workflow:
# Get job names for all eval jobs
sacct -j 2773062,2773063,2773065 --format=JobID,JobName%50
# Output shows epoch in job name:
# 2773062 eval-general_eval-lowlr-ep0
# 2773063 eval-general_eval-lowlr-ep1
# 2773065 eval-general_eval-lowlr-ep2
This approach is reliable because:
If extraction fails:
{"status": "error", "message": "..."}For binary classification tasks (0/1 targets), use summary_binary.py to compute additional metrics:
python tools/inspect/summary_binary.py {path_to_eval_file} --json
JSON output format:
{
"status": "success",
"path": "/path/to/file.eval",
"samples": 100,
"accuracy": 0.85,
"balanced_accuracy": 0.83,
"f1": 0.82,
"precision_1": 0.80,
"recall_1": 0.84,
"recall_0": 0.82,
"confusion_matrix": {"tp": 42, "tn": 43, "fp": 7, "fn": 8, "other": 0}
}
Why these metrics matter for imbalanced data:
Note: For non-binary tasks, only accuracy is reported (Bal. Acc and F1 shown as "-").
Create {experiment_dir}/summary.md with the following structure:
# Experiment Summary
**Experiment:** `{experiment_name}` | **Generated:** {timestamp} | **Status:** {X}/{Y} complete
## Run Status
| Run | Type | Fine-tuning | Evaluation |
|-----|------|-------------|------------|
| rank4_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED |
| rank8_lr1e-5 | Fine-tuned | COMPLETED | COMPLETED |
| base_model | Control | N/A | COMPLETED |
## Training Results
| Run | Final Loss | Total Steps | Epochs | Duration |
|-----|------------|-------------|--------|----------|
| rank4_lr1e-5 | 0.234 | 250 | 2 | 8m 15s |
| rank8_lr1e-5 | 0.198 | 250 | 2 | 9m 02s |
**Notes:**
- Base model runs have no training loss (control)
- Duration from SLURM elapsed time (if available)
## Evaluation Results
| Run | Task | Epoch | Accuracy | Bal. Acc | F1 | Samples |
|-----|------|-------|----------|----------|------|---------|
| rank4_lr1e-5 | capitalization | 0 | 0.85 | 0.83 | 0.82 | 100 |
| rank4_lr1e-5 | capitalization | 1 | 0.88 | 0.86 | 0.85 | 100 |
| rank8_lr1e-5 | capitalization | 0 | 0.82 | 0.80 | 0.78 | 100 |
| rank8_lr1e-5 | capitalization | 1 | 0.91 | 0.89 | 0.88 | 100 |
| base_model | capitalization | - | 0.45 | 0.50 | 0.31 | 100 |
**Best performing:** rank8_lr1e-5 (epoch 1) with 89% balanced accuracy
## Incomplete Runs
| Run | Stage | Status | Notes |
|-----|-------|--------|-------|
| rank16_lr1e-5 | Fine-tuning | FAILED | Check slurm-12345.out |
## Next Steps
1. View detailed evaluation results: `inspect view --port=$(get_free_port)`
2. Export raw data: `inspect log export {run_dir}/eval/logs/*.eval --format csv`
3. Full analysis: `analyze-experiment` (when available)
---
*Generated by summarize-experiment skill*
Document the process in {experiment_dir}/logs/summarize-experiment.log.
See logging.md for action types and format.
EXTRACT_LOSSRunning summarize-experiment multiple times overwrites summary.md. This is intentional:
{experiment_dir}/
├── summary.md # Human-readable summary (new)
└── logs/
└── summarize-experiment.log # Process log (new)
When analyze-experiment is built, summarize-experiment can either:
/plugin install extract-and-summarize-experiment-metrics@diegosouzapwRequires Claude Code CLI.
ML researchers automatically consolidate training metrics and evaluation results across multiple fine-tuned runs into a single readable summary after experiments complete.
No reviews yet. Be the first to review this skill.