Leaderboard

davanstrien/ocr-bench-britannica-results-qwen35

Rankings are computed using Bradley-Terry MLE from pairwise comparisons judged by a vision-language model. The judge sees the original document image alongside two anonymised OCR outputs and picks the more faithful transcription. Browse the comparisons to see the evidence — and vote yourself to build a Human ELO column. Human votes are stored locally for this session only and will reset when the server restarts.

# Model Params Judge ELO 95% CI Wins Losses Ties Win% Human ELO H-Win%
1 GLM-OCR 0.9B 1787 1727–1873 155 37 2 80%
2 LightOnOCR-2-1B 1B 1780 1727–1863 138 37 1 78%
3 FireRed-OCR 2.1B 1551 1502–1623 100 92 2 52% 2617 100
4 DeepSeek-OCR 4B 1437 1373–1507 75 118 1 39% 383 0
5 dots.ocr 1.7B 945 725–1045 5 189 0 3%

ELO vs Parameter Count

Smaller models can win on the right documents. Error bars show 95% confidence intervals.