Skill example · external (skill-creator)

skill-creator eval workspace

JSON-file workspace for evaluating and iterating on skills.

git clone github.com/anthropics/skills GitHub →

What it does

Each skill iteration produces eval results that you want to compare across versions — but a flat transcript can't answer "did iteration 8 actually get better than iteration 7?" The skill-creator eval workspace is a directory of structured JSON files: eval definitions live in evals/evals.json, each run produces its own subdirectory with grading.json + metrics.json + timing.json, and the workspace root carries history.json tracking version-over-version pass rates. With each artifact's shape stable, regression queries and pass-rate trends become diffs you can read instead of transcripts you have to re-trace.

How it does it

01

Define evals

Capture each test case in evals/evals.json — id, prompt, expected output, fixture files, and a list of expectations. Reusable across every skill version that gets tested.

evals/evals.json{ "skill_name": "pdf", "evals": [{ "id": 1, "prompt": "Extract names from form.pdf", "files": ["evals/files/form.pdf"], "expectations": [ "Output includes 'John Smith'", "Used the pdf-extract script" ] }] }
02

Run an iteration

Each skill version under test (v0, v1, v2) is an iteration. The executor agent runs every eval against this version and writes a per-run subdirectory.

workspace layoutworkspace/ ├── history.json └── runs/ └── v2/ └── eval-1/ ├── grading.json ├── outputs/metrics.json └── timing.json
03

Capture run metrics

The executor writes metrics.json (tool calls, files created, errors) and timing.json (durations, tokens) — the mechanical record of what happened.

outputs/metrics.json{ "tool_calls": {"Read": 5, "Bash": 8}, "total_steps": 6, "files_created": ["filled_form.pdf"], "errors_encountered": 0 }
04

Grade the run

The grader agent reads the run and writes grading.json — checks each expectation against the transcript, records evidence, computes a pass rate.

grading.json{ "expectations": [{ "text": "Output includes 'John Smith'", "passed": true, "evidence": "Found in Step 3" }], "summary": {"passed": 2, "failed": 1, "pass_rate": 0.67} }
05

Update history

After all evals run, the iteration's aggregate pass rate is appended to history.json at workspace root. That single file is the cross-iteration timeline.

history.json{ "skill_name": "pdf", "current_best": "v2", "iterations": [ {"version": "v0", "expectation_pass_rate": 0.65}, {"version": "v1", "expectation_pass_rate": 0.71}, {"version": "v2", "expectation_pass_rate": 0.84, "is_current_best": true} ] }
06

Pick the winning iteration

Read history.json to see which version is current_best; ship that skill version. The decision is a JSON read, not a vibe.

jqjq '.current_best' history.json # "v2"

Schema overview

FileWherePurpose
evals.json evals/ Reusable eval definitions: prompt, expected output, expectations
history.json workspace root Iteration timeline: version → pass rate, with current_best pointer
grading.json <run-dir>/ Per-expectation pass/fail with evidence; computed by the grader agent
metrics.json <run-dir>/outputs/ Mechanical run record: tool calls, steps, files created, errors
timing.json <run-dir>/ Wall-clock timing and token counts per run

Stable shapes are the value. Without them, "did this iteration get better?" is a transcript-reading task; with them, it's a jq query against history.json.

Go deeper