davanstrien/ocr-bench-britannica-results-qwen35
Rankings are computed using Bradley-Terry MLE from pairwise comparisons judged by a vision-language model. The judge sees the original document image alongside two anonymised OCR outputs and picks the more faithful transcription. Browse the comparisons to see the evidence — and vote yourself to build a Human ELO column. Human votes are stored locally for this session only and will reset when the server restarts.
| # | Model | Params | Judge ELO | 95% CI | Wins | Losses | Ties | Win% | Human ELO | H-Win% |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLM-OCR | 0.9B | 1787 | 1727–1873 | 155 | 37 | 2 | 80% | — | — |
| 2 | LightOnOCR-2-1B | 1B | 1780 | 1727–1863 | 138 | 37 | 1 | 78% | — | — |
| 3 | FireRed-OCR | 2.1B | 1551 | 1502–1623 | 100 | 92 | 2 | 52% | 2617 | 100 |
| 4 | DeepSeek-OCR | 4B | 1437 | 1373–1507 | 75 | 118 | 1 | 39% | 383 | 0 |
| 5 | dots.ocr | 1.7B | 945 | 725–1045 | 5 | 189 | 0 | 3% | — | — |
Smaller models can win on the right documents. Error bars show 95% confidence intervals.