skill-creator eval workspace
JSON-file workspace for evaluating and iterating on skills.
What it does
Each skill iteration produces eval results that you want to compare across versions — but a flat transcript can't answer "did iteration 8 actually get better than iteration 7?" The skill-creator eval workspace is a directory of structured JSON files: eval definitions live in evals/evals.json, each run produces its own subdirectory with grading.json + metrics.json + timing.json, and the workspace root carries history.json tracking version-over-version pass rates. With each artifact's shape stable, regression queries and pass-rate trends become diffs you can read instead of transcripts you have to re-trace.
How it does it
Define evals
Capture each test case in evals/evals.json — id, prompt, expected output, fixture files, and a list of expectations. Reusable across every skill version that gets tested.
Run an iteration
Each skill version under test (v0, v1, v2) is an iteration. The executor agent runs every eval against this version and writes a per-run subdirectory.
Capture run metrics
The executor writes metrics.json (tool calls, files created, errors) and timing.json (durations, tokens) — the mechanical record of what happened.
Grade the run
The grader agent reads the run and writes grading.json — checks each expectation against the transcript, records evidence, computes a pass rate.
Update history
After all evals run, the iteration's aggregate pass rate is appended to history.json at workspace root. That single file is the cross-iteration timeline.
Pick the winning iteration
Read history.json to see which version is current_best; ship that skill version. The decision is a JSON read, not a vibe.
Schema overview
| File | Where | Purpose |
|---|---|---|
evals.json |
evals/ |
Reusable eval definitions: prompt, expected output, expectations |
history.json |
workspace root | Iteration timeline: version → pass rate, with current_best pointer |
grading.json |
<run-dir>/ |
Per-expectation pass/fail with evidence; computed by the grader agent |
metrics.json |
<run-dir>/outputs/ |
Mechanical run record: tool calls, steps, files created, errors |
timing.json |
<run-dir>/ |
Wall-clock timing and token counts per run |
Stable shapes are the value. Without them, "did this iteration get better?" is a transcript-reading task; with them, it's a jq query against history.json.