What is this?
An example of what Jetty can do.
This site is a worked example: we took Simon Willison’s pelican-on-bicycle benchmark and used Jetty to hill-climb a prompt against it. Eleven iterations of the runbook. Seven different model/agent combinations. Every trajectory captured, scored, and replayable.
What you’re looking at
Jetty is an agentic evaluation platform. You describe a task as a runbook — a markdown file with instructions, constraints, and a scoring rubric — and Jetty runs an agent against it, captures every step as a trajectory, and scores the output. To improve the result, you edit the runbook and run it again.
Across eighteen trajectories we asked the same question two ways: What moves the score more — iterating the runbook, or iterating the model? Spoiler: about the same. Six points by editing the runbook. Five points by swapping the model. Both worth doing.
The climb view plots the score over time. The filmstrip lets you scrub through every SVG output. Head-to-Head compares the seven agent/model runs side by side. Runbook Diffs shows you exactly what changed between any two prompt versions.
Why this prompt
The benchmark was popularised by Simon Willison, who has been running it against every major LLM release since 2024. He picked the prompt deliberately, and for several reasons:
They shouldn’t be able to draw anything at all. But they can generate code… and SVG is code.
— Simon Willison, Jun 2025 ↗
Most importantly: pelicans can’t ride bicycles. They’re the wrong shape!
— Simon Willison ↗
Everyone needs their own benchmark. So I’ve been increasingly leaning on my own, which started as a joke but is beginning to show itself to actually be a little bit useful.
— Simon Willison ↗
The prompt is honest about what LLMs are: text models that can’t actually draw. Generating an SVG of a pelican on a bicycle forces them to reason in code about shape, anatomy, and spatial composition all at once. The fact that pelicans can’t ride bicycles makes the task creative rather than rote — it can’t be memorised from training data.
And because SVG supports comments, the model often narrates what it’s trying to do as it draws. You don’t just see the output — you see the intent. That makes it a diagnostic, not just a leaderboard.
Take the rest seriously… or don’t
Self-scoring isn’t cross-comparable — Sonnet caps near 37/40 against its own rubric while Flash hands out 40/40 freely. The hill we’re climbing is one rubric at a time. To turn this into a true leaderboard you’d add an external vision judge as a separate Jetty workflow step. Worth doing if you care about the absolute number; not necessary to see whether a runbook edit moved your model in the right direction.
That’s the actual point: the runbook is the artefact you control. Every iteration here was a small, deliberate edit — not a model swap, not new infrastructure. The work happens in the prompt.