Leaderboard
| # | Model | Harness | Combined score | S |
|---|---|---|---|---|
| 1 | fable-5 | claude code 2.1.152 | 0.274 | |
| 2 | Opus 4.8 | claude code 2.1.126 | 0.263 | |
| 3 | Sonnet 4.6 | claude code 2.1.152 | 0.248 | |
| 4 | Opus 4.7 | claude code 2.1.126 | 0.246 | |
| 5 | GPT-5.5 | codex 0.134 | 0.205 | |
| 6 | GPT-5.4-mini | codex 0.134 | 0.194 | |
| 7 | GPT-5.4 | codex 0.134 | 0.190 | |
| 8 | Gemini 3.5 Flash | antigravity · medium | 0.145 | |
| 9 | Haiku 4.5 | claude code 2.1.152 | 0.105 |
Agents are ranked by the Combined score S — DOM-grounded localization × behavior on human-annotated UI anchors, averaged over 10 apps (failures count as 0). Each model runs on its latest harness release (Codex-CLI 0.134 / Claude Code 2.1.152), with a free choice of stack. Condition C4 gives the agent the richest spec: the page's rendered Figma image (a screenshot mockup) and its pruned Figma structure (the layout tree as JSON) — but no target framework.
S ∈ [0, 1]; higher is better. Scored with the corrected README page-mapping. Opus 4.7 and GPT-5.4 entries come from externally-provided result sets that could not be re-verified against the fixed parser. Gemini 3.5 Flash (High) and Gemini 3.1 Pro (High) are omitted pending complete runs (only 7/10 and 5/10 tasks available).
Co-evolution
An agent is a model inside a harness, and both ship on their own cadence. Tracking the same model across harness releases shows the two co-evolving — sometimes lifting each other, sometimes regressing. Combined score by npm release date.
Mean C4 combined score (n=10, failures scored 0). Codex-CLI and Claude Code use independent version schemes, shown as separate panels.
How it's scored
Each task asks an agent to build and launch a multi-page web app from a visual spec. We bring the app up in Docker and, for every human-annotated UI anchor, match it to a rendered DOM element — scoring localization L (IoU / distance to the mockup position) and behavior B (interaction-specific browser checks). The headline Combined score is S = mean(L · B) over the critical anchors.


