Pelicans on a Bicycle on a Jetty
about
What is this?

An example of what Jetty can do.

This site is a worked example: we took Simon Willison’s pelican-on-bicycle benchmark and used Jetty to hill-climb a prompt against it. Eleven iterations of the runbook. Seven different model/agent combinations. Every trajectory captured, scored, and replayable.

What you’re looking at

Jetty is an agentic evaluation platform. You describe a task as a runbook — a markdown file with instructions, constraints, and a scoring rubric — and Jetty runs an agent against it, captures every step as a trajectory, and scores the output. To improve the result, you edit the runbook and run it again.

Across eighteen trajectories we asked the same question two ways: What moves the score more — iterating the runbook, or iterating the model? Spoiler: about the same. Six points by editing the runbook. Five points by swapping the model. Both worth doing.

The climb view plots the score over time. The filmstrip lets you scrub through every SVG output. Head-to-Head compares the seven agent/model runs side by side. Runbook Diffs shows you exactly what changed between any two prompt versions.

Why this prompt

The benchmark was popularised by Simon Willison, who has been running it against every major LLM release since 2024. He picked the prompt deliberately, and for several reasons:

They shouldn’t be able to draw anything at all. But they can generate code… and SVG is code.

— Simon Willison, Jun 2025 ↗

Most importantly: pelicans can’t ride bicycles. They’re the wrong shape!

— Simon Willison ↗

Everyone needs their own benchmark. So I’ve been increasingly leaning on my own, which started as a joke but is beginning to show itself to actually be a little bit useful.

— Simon Willison ↗

The prompt is honest about what LLMs are: text models that can’t actually draw. Generating an SVG of a pelican on a bicycle forces them to reason in code about shape, anatomy, and spatial composition all at once. The fact that pelicans can’t ride bicycles makes the task creative rather than rote — it can’t be memorised from training data.

And because SVG supports comments, the model often narrates what it’s trying to do as it draws. You don’t just see the output — you see the intent. That makes it a diagnostic, not just a leaderboard.

The hill climb, in three steps

three picks from eleven iterations
BIG LEAP

Embed the baseline

V2V3 32 → 34/40 +2
V2 · before
V3 · after · 34/40
The edit
“Use v2 SVG verbatim as round-1 input. Don't start from scratch.”
What happened
Round 1 score jumped from 28 to 33 — five points just from skipping the cold start. The single highest-leverage edit in the whole sequence.
Lesson
When the runbook can carry a working artifact as a seed, it should.
PEAK

Coordinate-precise asks

V4V5 36 → 37/40 +1
V4 · before
V5 · after · 37/40
The edit
“Close the 13-px right-wing-to-grip gap. Drop the left-wing tips to y ≈ 230.”
What happened
Composition hit 10/10 for the first time. The pelican unambiguously grips the bars. PEAK self-score on Sonnet's rubric — never beaten by any later iteration.
Lesson
At the top of the curve, the asks become measurements.
CAUTIONARY

Scope creep cost a point

V5V6 37 → 36/40 -1
V5 · before
V6 · after · 36/40
The edit
“Add motion lines + wind tufts + a sun. Extend right-wing covert lines to the wrist.”
What happened
Visually richer scene. Composition slipped back to 9. The bird and bike are still recognizable, but the new scene elements added clutter the rubric couldn't reward.
Lesson
More is not better when the rubric isn't measuring scene density.

Take the rest seriously… or don’t

Self-scoring isn’t cross-comparable — Sonnet caps near 37/40 against its own rubric while Flash hands out 40/40 freely. The hill we’re climbing is one rubric at a time. To turn this into a true leaderboard you’d add an external vision judge as a separate Jetty workflow step. Worth doing if you care about the absolute number; not necessary to see whether a runbook edit moved your model in the right direction.

That’s the actual point: the runbook is the artefact you control. Every iteration here was a small, deliberate edit — not a model swap, not new infrastructure. The work happens in the prompt.

Your turn

Build your own runbook.

Pick a task you care about. Write a runbook. Run it on Jetty. Watch the trajectory, edit the prompt, run it again. Free trial — no credit card.

Start free on jetty.io
Or browse the code on GitHub
Powered by Jetty · jetty.io
An agentic evaluation platform for AI/ML workflows