A pelican learns to ride.18 trajectories. 4 views.

Generate an SVG of a pelican riding a bicycle. We ran the same task eleven times with one model (iterating the prompt) and seven times with one prompt (iterating the model). Every trajectory is captured, scored, and replayable.

Eighteen trajectories. A peak of 40/40 on gemini-cli (generous Flash judge) and 37/40 on Sonnet’s harder rubric. Seven agent/model combos tried it — only 2:57 for the fastest.

Three views.

click any row to enter

7 agents · ranked

The Climb

All seven agent/model combinations on the same v1 runbook, ranked by best score. Scatter chart, table of best-of-10 rounds.

→ 02

10 rounds per agent · arrow-key nav

Head-to-Head

Filmstrip viewer through every round of every agent's hill climb. Use the keyboard: ←→ step through frames, ↑↓ switch agents.

→ 03

71 runbooks · 7 lineages

Runbook Diffs

Pick any two runbook versions across the 7 agent lineages and diff them line-by-line. Side-by-side or unified, copy-as-patch.

→

How it works under the hood

Each trajectory is a Jetty workflow: an agent runs a runbook against the task, self-critiques its output across three rounds, and writes a final SVG + report. Every step is captured for replay.

The task

Generate a pure-XML SVG of a pelican riding a bicycle. No <image> tags. Under 50 KB. viewBox 800×600. Both subjects unmistakable; pelican interacts with the bicycle.

The rubric

Four axes, 0–10 each: Pelican · Bicycle · Composition · Polish. Sonnet judges itself harshly (max ~37); Flash judges itself generously (40 freely awarded). Cross-rubric scores aren’t directly comparable.

The runbook

A markdown file describing the steps. The runbook can carry an embedded baseline SVG; iterating the runbook means editing this seed plus the targeted asks. Compare any two versions on the runbooks page.

An agentic evaluation platform for AI/ML workflows