Skill example · external (skill-creator)

skill-creator eval workspace

JSON-file workspace for evaluating and iterating on skills.

git clone github.com/anthropics/skills GitHub →

What it does

Each skill iteration produces eval results that you want to compare across versions — but a flat transcript can't answer "did iteration 8 actually get better than iteration 7?" The skill-creator eval workspace is a directory of structured JSON files: eval definitions live in evals/evals.json, each run produces its own subdirectory with grading.json + metrics.json + timing.json, and the workspace root carries history.json tracking version-over-version pass rates. With each artifact's shape stable, regression queries and pass-rate trends become diffs you can read instead of transcripts you have to re-trace.

How it does it

Define evals

Capture each test case in evals/evals.json — id, prompt, expected output, fixture files, and a list of expectations. Reusable across every skill version that gets tested.

evals/evals.json{ "skill_name": "pdf", "evals": [{ "id": 1, "prompt": "Extract names from form.pdf", "files": ["evals/files/form.pdf"], "expectations": [ "Output includes 'John Smith'", "Used the pdf-extract script" ] }] }

Run an iteration

Each skill version under test (v0, v1, v2) is an iteration. The executor agent runs every eval against this version and writes a per-run subdirectory.

workspace layoutworkspace/ ├── history.json └── runs/ └── v2/ └── eval-1/ ├── grading.json ├── outputs/metrics.json └── timing.json

Capture run metrics

The executor writes metrics.json (tool calls, files created, errors) and timing.json (durations, tokens) — the mechanical record of what happened.

outputs/metrics.json{ "tool_calls": {"Read": 5, "Bash": 8}, "total_steps": 6, "files_created": ["filled_form.pdf"], "errors_encountered": 0 }

Grade the run

The grader agent reads the run and writes grading.json — checks each expectation against the transcript, records evidence, computes a pass rate.

grading.json{ "expectations": [{ "text": "Output includes 'John Smith'", "passed": true, "evidence": "Found in Step 3" }], "summary": {"passed": 2, "failed": 1, "pass_rate": 0.67} }

Update history

After all evals run, the iteration's aggregate pass rate is appended to history.json at workspace root. That single file is the cross-iteration timeline.

history.json{ "skill_name": "pdf", "current_best": "v2", "iterations": [ {"version": "v0", "expectation_pass_rate": 0.65}, {"version": "v1", "expectation_pass_rate": 0.71}, {"version": "v2", "expectation_pass_rate": 0.84, "is_current_best": true} ] }

Pick the winning iteration

Read history.json to see which version is current_best; ship that skill version. The decision is a JSON read, not a vibe.

jqjq '.current_best' history.json # "v2"

Schema overview

File	Where	Purpose
`evals.json`	`evals/`	Reusable eval definitions: prompt, expected output, expectations
`history.json`	workspace root	Iteration timeline: version → pass rate, with `current_best` pointer
`grading.json`	`<run-dir>/`	Per-expectation pass/fail with evidence; computed by the grader agent
`metrics.json`	`<run-dir>/outputs/`	Mechanical run record: tool calls, steps, files created, errors
`timing.json`	`<run-dir>/`	Wall-clock timing and token counts per run

Stable shapes are the value. Without them, "did this iteration get better?" is a transcript-reading task; with them, it's a jq query against history.json.

skill-creator eval workspace

What it does

How it does it

Define evals

Run an iteration

Capture run metrics

Grade the run

Update history

Pick the winning iteration

Schema overview

Go deeper