Leaderboard

davanstrien/ocr-bench-britannica-results-qwen35

Rankings are computed using Bradley-Terry MLE from pairwise comparisons judged by a vision-language model. The judge sees the original document image alongside two anonymised OCR outputs and picks the more faithful transcription. Browse the comparisons to see the evidence — and vote yourself to build a Human ELO column. Human votes are stored locally for this session only and will reset when the server restarts.

# Model Params Judge ELO 95% CI Wins Losses Ties Win% Human ELO H-Win%
1 GLM-OCR 0.9B 1716 1673–1769 182 60 2 75% 1544 50
2 LightOnOCR-2-1B 1B 1697 1655–1749 158 61 1 72% 1456 25
3 NuExtract3 4B 1649 1604–1708 162 82 0 66%
4 FireRed-OCR 2.1B 1506 1469–1548 115 127 2 47%
5 DeepSeek-OCR 4B 1421 1374–1465 90 153 1 37%
6 dots.ocr 1.7B 1011 881–1090 10 234 0 4%

ELO vs Parameter Count

Smaller models can win on the right documents. Error bars show 95% confidence intervals.