davanstrien/ocr-bench-britannica-results-qwen35
Rankings are computed using Bradley-Terry MLE from pairwise comparisons judged by a vision-language model. The judge sees the original document image alongside two anonymised OCR outputs and picks the more faithful transcription. Browse the comparisons to see the evidence — and vote yourself to build a Human ELO column. Human votes are stored locally for this session only and will reset when the server restarts.
| # | Model | Params | Judge ELO | 95% CI | Wins | Losses | Ties | Win% | Human ELO | H-Win% |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | GLM-OCR | 0.9B | 1716 | 1673–1769 | 182 | 60 | 2 | 75% | 1544 | 50 |
| 2 | LightOnOCR-2-1B | 1B | 1697 | 1655–1749 | 158 | 61 | 1 | 72% | 1456 | 25 |
| 3 | NuExtract3 | 4B | 1649 | 1604–1708 | 162 | 82 | 0 | 66% | — | — |
| 4 | FireRed-OCR | 2.1B | 1506 | 1469–1548 | 115 | 127 | 2 | 47% | — | — |
| 5 | DeepSeek-OCR | 4B | 1421 | 1374–1465 | 90 | 153 | 1 | 37% | — | — |
| 6 | dots.ocr | 1.7B | 1011 | 881–1090 | 10 | 234 | 0 | 4% | — | — |
Smaller models can win on the right documents. Error bars show 95% confidence intervals.