Android Bench
AI-assisted software engineering has seen the emergence of several benchmarks to measure the capabilities of LLMs. Android developers face specific challenges that aren't covered by existing benchmarks, so we created one that focuses on a north star of high quality Android development.
Android LLM Leaderboard
| Model | Score (%) Average percentage of 100 test cases successfully resolved across 10 runs for each model |
arrow_range
Cl range (%)
Expected performance range, reflecting the results' statistical reliability (p-value < 0.05)
|
Avg latency (h)
Average time taken to solve 100 tasks across 10 runs
|
Avg total tokens (M)
Average token consumption for a full benchmark run (100 tasks) across 10 runs
|
Avg cost ($)
Average cost per full benchmark run
|
Date |
|---|---|---|---|---|---|---|
|
|
74.0 | 66.8 — 80.5 | 15.5 | 64.5 | $133.9 | 2026-04-27 |
|
|
72.4 | 65.4 — 79.3 | 21.2 | 64.2 | $91.7 | 2026-03-16 |
|
|
72.4 | 65.1 — 78.8 | 11.5 | 75.4 | $49.0 | 2026-02-27 |
|
|
68.7 | 60.5 — 75.9 | 11.6 | 90.0 | $124.3 | 2026-04-27 |
|
|
67.7 | 59.9 — 75.6 | 11.2 | 71.4 | $42.6 | 2026-03-18 |
|
|
66.6 | 59.1 — 74.1 | 9.9 | 69.5 | $84.4 | 2026-02-26 |
|
|
62.5 | 54.4 — 70.0 | 24.3 | 124.4 | $121.9 | 2026-02-27 |
|
|
61.9 | 53.9 — 70.2 | 12.5 | 79.8 | $102.5 | 2026-02-26 |
|
|
60.4 | 52.3 — 67.7 | 9.8 | 117.0 | $63.7 | 2026-02-27 |
|
|
59.7 | 52.4 — 67.4 | 33.4 | 80.2 | $46.7 | 2026-05-08 |
|
|
58.4 | 50.3 — 66.4 | 8.2 | 47.9 | $40.4 | 2026-03-01 |
|
|
58.6 | 51.3 — 66.5 | 29.9 | 94.3 | $42.5 | 2026-05-10 |
|
|
55.4 | 47.5 — 63.6 | 35.8 | 132.7 | $13.7 | 2026-05-08 |
|
|
54.2 | 45.9 — 62.2 | 13.1 | 92.9 | $60.3 | 2026-02-26 |
|
|
52.7 | 45.3 — 60.7 | 28.1 | 164.7 | $8.4 | 2026-05-11 |
|
|
52.0 | 43.8 — 60.0 | 33.1 | 97.5 | $74.5 | 2026-05-09 |
|
|
51.4 | 43.5 — 59.3 | 20.5 | 103.0 | $222.4 | 2026-05-07 |
|
|
42.0 | 36.6 — 47.3 | 16.5 | 148.0 | $34.2 | 2026-02-26 |
|
|
37.2 | 30.3 — 44.9 | 20.3 | 128.3 | $10.1 | 2026-05-01 |
|
|
37.4 | 30.5 — 44.5 | 20.7 | 112.3 | $64.6 | 2026-05-05 |
|
|
33.2 | 26.2 — 40.8 | 14.2 | 29.5 | $2.5 | 2026-05-01 |
|
|
31.7 | 24.4 — 39.0 | 12.5 | 113.4 | $10.7 | 2026-05-05 |
|
|
29.1 | 22.3 — 36.1 | 8.4 | 37.9 | $35.8 | 2026-03-02 |
|
|
25.1 | 18.8 — 31.8 | 21.4 | 77.2 | $3.3 | 2026-05-01 |
|
|
18.9 | 13.1 — 25.1 | 25.9 | 122.7 | $7.6 | 2026-05-09 |
|
|
15.9 | 10.7 — 21.1 | 4.9 | 108.8 | $11.2 | 2026-02-26 |
|
|
15.5 | 10.1 — 20.9 | 16.6 | 181.4 | $15.6 | 2026-05-07 |
|
|
2.4 | 1.2 — 3.9 | 3.8 | 12.0 | $0.2 | 2026-05-11 |
Latest results as of May 18th 2026: This refresh includes open-weight models, adding new columns for latency, tokens, and cost.
Check back periodically for updates!
Check back periodically for updates!
Learn more about Android Bench
Our methodology
Learn more about how we created a set of common Android developer tasks.
Android best practices
Many of the tasks are based on how we define high quality Android development, which is detailed in our developer documentation.
GitHub repo
See the full repo so you can replicate the tests yourself.