VISTA Leaderboard — Visual Spec-to-Web-App Coding Agents

Leaderboard

Combined score — condition C4 · latest harness

#	Model	Harness	S
1	fable-5	claude code 2.1.152	0.274
2	Opus 4.8	claude code 2.1.126	0.263
3	Sonnet 4.6	claude code 2.1.152	0.248
4	Opus 4.7	claude code 2.1.126	0.246
5	GPT-5.5	codex 0.134	0.205
6	GPT-5.4-mini	codex 0.134	0.194
7	GPT-5.4	codex 0.134	0.190
8	Gemini 3.5 Flash	antigravity · medium	0.145
9	Haiku 4.5	claude code 2.1.152	0.105

Agents are ranked by the Combined score S — DOM-grounded localization × behavior on human-annotated UI anchors, averaged over 10 apps (failures count as 0). Each model runs on its latest harness release (Codex-CLI 0.134 / Claude Code 2.1.152), with a free choice of stack. Condition C4 gives the agent the richest spec: the page's rendered Figma image (a screenshot mockup) and its pruned Figma structure (the layout tree as JSON) — but no target framework.

S ∈ [0, 1]; higher is better. Scored with the corrected README page-mapping. Opus 4.7 and GPT-5.4 entries come from externally-provided result sets that could not be re-verified against the fixed parser. Gemini 3.5 Flash (High) and Gemini 3.1 Pro (High) are omitted pending complete runs (only 7/10 and 5/10 tasks available).

Co-evolution

Harness × LLM — co-evolving over releases

An agent is a model inside a harness, and both ship on their own cadence. Tracking the same model across harness releases shows the two co-evolving — sometimes lifting each other, sometimes regressing. Combined score by npm release date.

Codex-CLI — all three GPT models dip from the Apr-30 (0.128) to May-26 (0.134) build; GPT-5.4 peaks at 0.128.

Claude Code — Sonnet 4.6 rises monotonically across releases; fable-5 (2.1.152) tops the field; Haiku 4.5 trails. Opus 4.7/4.8 shown at 2.1.126.

Mean C4 combined score (n=10, failures scored 0). Codex-CLI and Claude Code use independent version schemes, shown as separate panels.

How it's scored

DOM-grounded, behavior-aware

Each task asks an agent to build and launch a multi-page web app from a visual spec. We bring the app up in Docker and, for every human-annotated UI anchor, match it to a rendered DOM element — scoring localization L (IoU / distance to the mockup position) and behavior B (interaction-specific browser checks). The headline Combined score is S = mean(L · B) over the critical anchors.

Pipeline: a visual spec drives the agent (model × harness) to build a runnable app; each annotated UI anchor is matched to a DOM element and scored on localization and behavior, combined into S.