Pelicans on a Bicycle on a Jetty
visualization 1 of 3 · the climb
about

The climb.

Seven agents on the same v1 runbook. Same task in, very different birds out. Each row links to that agent's full 10-round hill climb.

Speed × best score

click a dot to view that agent

Seven agents, ranked.

click any thumb to jump to its hill climb
# preview best score pel. bike comp. polish time run_id agent · model
01
40/40
10 10 10 10 2:57 26621a9e
gemini-cli
gemini-3.5-flash
Fastest run. Flash's self-judge is generous — treat 40/40 as ceiling, not score.
02
40/40
10 10 10 10 3:58 a6439f91
opencode
google/gemini-3.5-flash
Beach scene with helmet + basket of fish. Most illustrated of the seven.
03
FRESH
39.5/40
9.5 10 10 10 4:18 537afc12
hermes
openrouter/google/gemini-3.5-flash
Same agent (hermes), Sonnet → Flash. +5 points from the model swap alone.
04
37/40
9 9 10 9 6:53 7ddc3c19
claude-code
claude-sonnet-4-6
Reference baseline. Sonnet's own rubric caps at 37/40 across 11 hand-curated revisions.
05
36.5/40
9.5 9.5 8.5 9 5:34 227d16dd
hermes
openrouter/z-ai/glm-5.1
Zhipu's GLM 5.1 via OpenRouter. Mid-pack and consistent — no judge inflation.
06
36/40
9 9 9 9 5:02 6aa47ea4
claude-code
claude-opus-4-7
Same agent (claude-code), model swap Sonnet → Opus. Cleaner anatomy at higher cost.
07
A Pelican Riding a Bicycle
36/40
9 9 9 9 5:06 f425696d
hermes
openrouter/anthropic/claude-sonnet-4.6
Hermes routed to Sonnet via OpenRouter. Parts floated free of the bike on the initial run.
Powered by Jetty·jetty.io
An agentic evaluation platform for AI/ML workflows