Pelicans on a Bicycle on a Jetty
visualization 2 of 3 · head-to-head
about

Seven agents. Ten rounds each.

Each agent starts from the same v1 runbook and hill-climbs ten auto-evolved rounds. Use / to step through frames, / to switch agent.

Speed × best score

click a dot to jump to that agent
claude-code · v1 · /40
v1/10
agent

v1 → v10 hill climb

step · click any thumb

Six findings.

what the comparison reveals
01

Gemini Flash dominates aesthetically

Across three different agents (gemini-cli, opencode, hermes), Flash produces the highest aesthetic scores. The same v1 runbook + hermes climbs from 34/40 with Sonnet to 39/40 with Flash. The model — not the agent — drives the look.

02

Opus and Sonnet plateau at 36-37

Same agent (claude-code), same task. The Anthropic family caps near 36/40 on its own rubric — even Sonnet's 11 hand-curated revisions never beat 37. Self-judges set their own ceilings.

03

Hermes silently cold-fails on bare model names

Unprefixed claude-sonnet-4-6 dies in 14s with no agent config saved. With openrouter/anthropic/claude-sonnet-4.6 (or any openrouter/<provider>/<model>), it runs the full sweep.

04

GLM 5.1 is the most honest climber

Zhipu's GLM 5.1 progressed cleanly: 24 → 30 → 35 → 36.5 across rounds. No judge inflation, no early-exit gaming — a steady auto-evolved curve.

05

Flash is ~2× faster than Anthropic

gemini-cli 3:00 · opencode 4:00 · hermes+flash 4:18 · opus 5:00 · hermes+glm 5:34 · hermes+sonnet 5:06 · sonnet 7:00. Throughput-per-minute matters when iterating on prompts.

06

Self-scoring isn't cross-comparable

Flash gives itself 40/40 freely; Sonnet's rubric tops out around 36–37 even on great output. A real leaderboard needs an external vision judge step rating all outputs against the same rubric.

Open question

Self-scoring isn’t cross-comparable. To turn this into a real leaderboard, you’d need an external vision judge rating all seven outputs against the same rubric — worth its own task on Jetty.

Powered by Jetty·jetty.io
An agentic evaluation platform for AI/ML workflows