Leaderboard

davanstrien/ocr-bench-britannica-results-qwen35

Rankings are computed using Bradley-Terry MLE from pairwise comparisons judged by a vision-language model. The judge sees the original document image alongside two anonymised OCR outputs and picks the more faithful transcription. Browse the comparisons to see the evidence — and vote yourself to build a Human ELO column. Human votes are stored locally for this session only and will reset when the server restarts.

#	Model	Params	Judge ELO	95% CI	Wins	Losses	Ties	Win%	Human ELO	H-Win%
1	GLM-OCR	0.9B	1716	1673–1769	182	60	2	75%	1544	50
2	LightOnOCR-2-1B	1B	1697	1655–1749	158	61	1	72%	1456	25
3	NuExtract3	4B	1649	1604–1708	162	82	0	66%	—	—
4	FireRed-OCR	2.1B	1506	1469–1548	115	127	2	47%	—	—
5	DeepSeek-OCR	4B	1421	1374–1465	90	153	1	37%	—	—
6	dots.ocr	1.7B	1011	881–1090	10	234	0	4%	—	—

ELO vs Parameter Count

Smaller models can win on the right documents. Error bars show 95% confidence intervals.