Head-to-Head · Pelican-on-Bicycle SVG

Each agent starts from the same v1 runbook and hill-climbs auto-evolved rounds (ten for most, three for the Fusion sweeps). Use ←/→ to step through frames, ↑/↓ to switch agent.

Gemini Flash dominates aesthetically

Across three different agents (gemini-cli, opencode, hermes), Flash produces the highest aesthetic scores. The same v1 runbook + hermes climbs from 34/40 with Sonnet to 39/40 with Flash. The model — not the agent — drives the look.

Opus and Sonnet plateau at 36-37

Same agent (claude-code), same task. The Anthropic family caps near 36/40 on its own rubric — even Sonnet's 11 hand-curated revisions never beat 37. Self-judges set their own ceilings.

Hermes silently cold-fails on bare model names

Unprefixed claude-sonnet-4-6 dies in 14s with no agent config saved. With openrouter/anthropic/claude-sonnet-4.6 (or any openrouter/<provider>/<model>), it runs the full sweep.

GLM 5.1 is the most honest climber

Zhipu's GLM 5.1 progressed cleanly: 24 → 30 → 35 → 36.5 across rounds. No judge inflation, no early-exit gaming — a steady auto-evolved curve.

Flash is ~2× faster than Anthropic

gemini-cli 3:00 · opencode 4:00 · hermes+flash 4:18 · opus 5:00 · hermes+glm 5:34 · hermes+sonnet 5:06 · sonnet 7:00. Throughput-per-minute matters when iterating on prompts.

Self-scoring isn't cross-comparable

Flash gives itself 40/40 freely; Sonnet's rubric tops out around 36–37 even on great output. A real leaderboard needs an external vision judge step rating all outputs against the same rubric.

Nine agents. One v1 runbook.

Speed × best score

v1 → v10 hill climb

Six findings.