Benchmark · Leaderboard

VISTA Leaderboard

Visual Spec-To-App Benchmark — how well do coding agents build real web apps from a design?

10 app categories · 128 annotated pages · 458 visual anchor points · built from Figma designs

Leaderboard

Combined score — condition C4 · latest harness
#ModelHarnessCombined scoreS
1fable-5claude code 2.1.152
0.274
2Opus 4.8claude code 2.1.126
0.263
3Sonnet 4.6claude code 2.1.152
0.248
4Opus 4.7claude code 2.1.126
0.246
5GPT-5.5codex 0.134
0.205
6GPT-5.4-minicodex 0.134
0.194
7GPT-5.4codex 0.134
0.190
8Gemini 3.5 Flashantigravity · medium
0.145
9Haiku 4.5claude code 2.1.152
0.105

Agents are ranked by the Combined score S — DOM-grounded localization × behavior on human-annotated UI anchors, averaged over 10 apps (failures count as 0). Each model runs on its latest harness release (Codex-CLI 0.134 / Claude Code 2.1.152), with a free choice of stack. Condition C4 gives the agent the richest spec: the page's rendered Figma image (a screenshot mockup) and its pruned Figma structure (the layout tree as JSON) — but no target framework.

S ∈ [0, 1]; higher is better. Scored with the corrected README page-mapping. Opus 4.7 and GPT-5.4 entries come from externally-provided result sets that could not be re-verified against the fixed parser. Gemini 3.5 Flash (High) and Gemini 3.1 Pro (High) are omitted pending complete runs (only 7/10 and 5/10 tasks available).

Co-evolution

Harness × LLM — co-evolving over releases

An agent is a model inside a harness, and both ship on their own cadence. Tracking the same model across harness releases shows the two co-evolving — sometimes lifting each other, sometimes regressing. Combined score by npm release date.

0.300.200.100.00 0.1160.1280.134 Mar 19Apr 30May 26 Codex-CLI version (release date) → GPT-5.4 GPT-5.5 GPT-5.4-mini
Codex-CLI — all three GPT models dip from the Apr-30 (0.128) to May-26 (0.134) build; GPT-5.4 peaks at 0.128.
0.300.200.100.00 2.1.582.1.1262.1.152 Feb 25Apr 30May 26 Claude Code version (release date) → Sonnet 4.6 Haiku 4.5 Opus 4.7 Opus 4.8 fable-5
Claude Code — Sonnet 4.6 rises monotonically across releases; fable-5 (2.1.152) tops the field; Haiku 4.5 trails. Opus 4.7/4.8 shown at 2.1.126.

Mean C4 combined score (n=10, failures scored 0). Codex-CLI and Claude Code use independent version schemes, shown as separate panels.

How it's scored

DOM-grounded, behavior-aware

Each task asks an agent to build and launch a multi-page web app from a visual spec. We bring the app up in Docker and, for every human-annotated UI anchor, match it to a rendered DOM element — scoring localization L (IoU / distance to the mockup position) and behavior B (interaction-specific browser checks). The headline Combined score is S = mean(L · B) over the critical anchors.

Visual spec Agent Live app Anchor → DOM Combined S mockup · Figma · anchors model × harness docker compose up localize L × behavior B mean(L · B)
Pipeline: a visual spec drives the agent (model × harness) to build a runnable app; each annotated UI anchor is matched to a DOM element and scored on localization and behavior, combined into S.