Android Bench

AI-assisted software engineering has seen the emergence of several benchmarks to measure the capabilities of LLMs. Android developers face specific challenges that aren't covered by existing benchmarks, so we created one that focuses on a north star of high quality Android development.

Android LLM Leaderboard

Model	Score (%)	arrow_range Cl range (%)	Date
GPT-5.4	72.4	65.1 — 79.3	2026-03-16
Gemini 3.1 Pro Preview	72.4	64.8 — 79.3	2026-02-27
GPT-5.3-Codex	67.7	60.1 — 74.8	2026-03-18
Claude Opus 4.6	66.6	58.5 — 74.0	2026-02-26
GPT-5.2-Codex	62.5	54.8 — 69.8	2026-02-26
Claude Opus 4.5	61.9	53.8 — 70.3	2026-02-26
Gemini 3 Pro Preview	60.4	52.4 — 68.1	2026-02-27
Claude Sonnet 4.6	58.4	50.9 — 66.5	2026-02-27
Claude Sonnet 4.5	54.2	46.0 — 62.1	2026-02-26
Gemini 3 Flash Preview	42.0	36.4 — 47.7	2026-02-26
Gemini 2.5 Flash	16.1	11.2 — 21.2	2026-02-26

Latest results as of April 7th 2026: This refresh includes the addition of GPT-5.4 and GPT-5.3-Codex.
Check back periodically for updates!

Score is the average percentage of 100 test cases successfully resolved across 10 runs for each model.
Confidence Interval (CI) represents the expected performance range, reflecting the results' statistical reliability (p-value < 0.05).

Android Bench

Android LLM Leaderboard

Learn more about Android Bench

Our methodology

Android best practices

GitHub repo