Leaderboard

davanstrien/ocr-bench-britannica-results-qwen35

Rankings are computed using Bradley-Terry MLE from pairwise comparisons judged by a vision-language model. The judge sees the original document image alongside two anonymised OCR outputs and picks the more faithful transcription. Browse the comparisons to see the evidence — and vote yourself to build a Human ELO column. Human votes are stored locally for this session only and will reset when the server restarts.

#	Model	Params	Judge ELO	95% CI	Wins	Losses	Ties	Win%	Human ELO	H-Win%
1	GLM-OCR	0.9B	1787	1727–1873	155	37	2	80%	—	—
2	LightOnOCR-2-1B	1B	1780	1727–1863	138	37	1	78%	—	—
3	FireRed-OCR	2.1B	1551	1502–1623	100	92	2	52%	2617	100
4	DeepSeek-OCR	4B	1437	1373–1507	75	118	1	39%	383	0
5	dots.ocr	1.7B	945	725–1045	5	189	0	3%	—	—

ELO vs Parameter Count

Smaller models can win on the right documents. Error bars show 95% confidence intervals.