I’ve also heard very good things about these two in particular:
- LightOnOCR-2-1B: https://huggingface.co/lightonai/LightOnOCR-2-1B
- PaddleOCR-VL-1.5: https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5
The OCR leaderboards I’ve seen leave a lot to be desired.
With the rapid release of so many of these models, I wish there were a better way to know which ones are actually the best.
I also feel like most/all of these models don’t handle charts, other than to maybe include a link to a cropped image. It would be nice for the OCR model to also convert charts into markdown tables, but this is obviously challenging.
This project has been pretty easy to build with agentic coding. It's a Frankenstein monster of glue code and handling my particular domain requirements, so it's not suitable for public release. I'd encourage some rapid prototyping after you've spent an afternoon catching up on what's new. I did a lot of document OCR and post-processing with commercial tools and custom code 15 years ago. The advent of small local VLMs has made it practical to achieve higher accuracy and more domain customization than I would have previously believed.
[1] If you're building an advanced document processing workflow, be sure to read the post-processing code in the GLM code repo. They're doing some non-trivial logic to fuse layout areas and transform text for smooth reading. You probably want to store the raw model results and customize your own post-processing for uncommon languages or uncommon domain vocabulary. Layout is also easier to validate if you bypass their post-processing; it can make some combined areas "disappear" from the layout data.
EDIT: https://github.com/overcuriousity/pdf2epub looks interesting.
And here's the kicker. I can't afford mistakes. Missing a single character or misinterpreting it could be catastrophic. 4 units vacant? 10 days to respond? Signature missing? Incredibly critical things. I can't find an eval that gives me confidence around this.
I remember that one clearing the scoreboard for many years, and usually it's the one I grab for OCR needs due to its reputation.
Also, do you have preferred OCR models in your experience? I've had some success with dots.OCR, but I'm only beginning to need to work with OCR.
But, as others said, if you can't afford mistakes, then you're going to need a human in the loop to take responsibility.
I've thought of open sourcing the wrapper but havent gotten around to it yet. I bet claude code can build a functioning prototype if you just point it to "screen_ai" dir under chrome's user data.
The new models are similarly better compared to tesseract v4. But what I'll say is that don't expect new models to be a panacea for your OCR problems. The edge case problems that you might be trying to solve (like, identifying anchor points, or identifying shared field names across documents) are still pretty much all problematic still. So you should still expect things like random spaces or unexpected characters to jam up your jams.
Also some newer models tend to hallucinate incredibly aggressively. If you've ever seen an LLM get stuck in an infinite, think of that.
I think a more accurate reflection of the current state of comparisons would be a real-world benchmark with messy/complex docs across industries, languages.
Not for OCR.
Regardless of how much some people complain about them, I really do appreciate the effort Artificial Analysis puts into consistently running standardized benchmarks for LLMs, rather than just aggregating unverified claims from the AI labs.
I don't think LMArena is that amazing at this point in time, but at least they provide error bars on the ELO and give models the same rank number when they're overlapping.
> Also, do you have preferred OCR models in your experience?
It's a subject I'm interested in, but I don't have enough experience to really put out strong opinions on specific models.
It also doesn't provide error bars on the ELO, so models that only have tens of battles are being listed alongside models that have thousands of battles with no indication of how confident those ELOs are, which I find rather unhelpful.
A lot of these models are also sensitive to how they are used, and offer multiple ways to be used. It's not clear how they are being invoked.
That leaderboard is definitely one of the ones that leaves a lot to be desired.
I can feed it a multiple page PDF and tell it to convert it to markdown and it does this well. I don't need to load the pages one at a time as long as I use the PDF format. (This was tested on A.i. studio but I think the API works the same way).
Any idea what model is being used?
How fast was it per page? Do you recall if it's CPU or GPU based? TY!
Isn’t this close to the error rate of human transcription for messy input, though? I seem to remember a figure in that ballpark. I think if your use case is this sensitive, then any transcription is suspicious.
How many pages did you try in a single request? 5? 50? 500?
I fully believe that 5 pages of input works just fine, but this does not scale up to larger documents, and the goal of OCR is usually to know what is actually written on the page... not what "should" have been written on the page. I think a larger number of pages makes it more likely for the LLM to hallucinate as it tries to "correct" errors that it sees, which is not the task. If that is a desirable task, I think it would be better to post-process the document with an LLM after it is converted to text, rather than asking the LLM to both read a large number of images and correct things at the same time, which is asking a lot.
Once the document gets long enough, current LLMs will get lazy and stop providing complete OCR for every page in their response.
One page at a time keeps the LLM focused on the task, and it's easy to parallelize so entire documents can be OCR'd quickly.
I never tested Gemini 3 PDF OCR compared to individual images but I can say it processes a small 6 page PDF better than the retired Gemini 1.5 or 2 did individual images.
I agree that OCR and analysis should be two separate steps.
My documents have one or two-column layouts, often inconsistently across pages or even within a page (which tripped older layout detection methods). Most models seem to understand that well enough so they are good enough for my use case.
Also, there are generalist models that have enough of a grasp of a dozen or so languages that fit comfortably in 7B parameters. Like the older Mistral, which had the best multi-lingual support at the time, but newer models around that size are probably good candidates. I am not surprised that a multilingual specialised model can fit in 8B or so.
👋 Join our WeChat and Discord community
📍 Use GLM-OCR's API
GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.
Key Features
State-of-the-Art Performance: Achieves a score of 94.62 on OmniDocBench V1.5, ranking #1 overall, and delivers state-of-the-art results across major document understanding benchmarks, including formula recognition, table recognition, and information extraction.
Optimized for Real-World Scenarios: Designed and optimized for practical business use cases, maintaining robust performance on complex tables, code-heavy documents, seals, and other challenging real-world layouts.
Efficient Inference: With only 0.9B parameters, GLM-OCR supports deployment via vLLM, SGLang, and Ollama, significantly reducing inference latency and compute cost, making it ideal for high-concurrency services and edge deployments.
Easy to Use: Fully open-sourced and equipped with a comprehensive SDK and inference toolchain, offering simple installation, one-line invocation, and smooth integration into existing production pipelines.
| Model | Download Links | Precision |
|---|---|---|
| GLM-OCR | 🤗 Hugging Face 🤖 ModelScope |
BF16 |
We provide an SDK for using GLM-OCR more efficiently and conveniently.
# Install from source
git clone https://github.com/zai-org/glm-ocr.git
cd glm-ocr
uv venv --python 3.12 --seed && source .venv/bin/activate
uv pip install -e .
Two ways to use GLM-OCR:
Use the hosted cloud API – no GPU needed. The cloud service runs the complete GLM-OCR pipeline internally, so the SDK simply forwards your request and returns the result.
config.yaml:pipeline:
maas:
enabled: true # Enable MaaS mode
api_key: your-api-key # Required
That's it! When maas.enabled=true, the SDK acts as a thin wrapper that:
Input note (MaaS): the upstream API accepts file as a URL or a data:<mime>;base64,... data URI.
If you have raw base64 without the data: prefix, wrap it as a data URI (recommended). The SDK will
auto-wrap local file paths / bytes / raw base64 into a data URI when calling MaaS.
API documentation: https://docs.bigmodel.cn/cn/guide/models/vlm/glm-ocr
Deploy the GLM-OCR model locally for full control. The SDK provides the complete pipeline: layout detection, parallel region OCR, and result formatting.
Install vLLM:
uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
# Or use Docker
docker pull vllm/vllm-openai:nightly
Launch the service:
# In docker container, uv may not be need for transformers install
uv pip install git+https://github.com/huggingface/transformers.git
# Run with MTP for better performance
vllm serve zai-org/GLM-OCR --allowed-local-media-path / --port 8080 --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' --served-model-name glm-ocr
Install SGLang:
docker pull lmsysorg/sglang:dev
# Or build from source
uv pip install git+https://github.com/sgl-project/sglang.git#subdirectory=python
Launch the service:
# In docker container, uv may not be need for transformers install
uv pip install git+https://github.com/huggingface/transformers.git
# Run with MTP for better performance
python -m sglang.launch_server --model zai-org/GLM-OCR --port 8080 --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --served-model-name glm-ocr
# Modify the speculative config base on your device
After launching the service, configure config.yaml:
pipeline:
maas:
enabled: false # Disable MaaS mode (default)
ocr_api:
api_host: localhost # or your vLLM/SGLang server address
api_port: 8080
For specialized deployment scenarios, see the detailed guides:
# Parse a single image
glmocr parse examples/source/code.png
# Parse a directory
glmocr parse examples/source/
# Set output directory
glmocr parse examples/source/code.png --output ./results/
# Use a custom config
glmocr parse examples/source/code.png --config my_config.yaml
# Enable debug logging with profiling
glmocr parse examples/source/code.png --log-level DEBUG
from glmocr import GlmOcr, parse
# Simple function
result = parse("image.png")
result = parse(["img1.png", "img2.jpg"])
result = parse("https://example.com/image.png")
result.save(output_dir="./results")
# Note: a list is treated as pages of a single document.
# Class-based API
with GlmOcr() as parser:
result = parser.parse("image.png")
print(result.json_result)
result.save()
# Start service
python -m glmocr.server
# With debug logging
python -m glmocr.server --log-level DEBUG
# Call API
curl -X POST http://localhost:5002/glmocr/parse \
-H "Content-Type: application/json" \
-d '{"images": ["./example/source/code.png"]}'
Semantics:
images can be a string or a list.Full configuration in glmocr/config.yaml:
# Server (for glmocr.server)
server:
host: "0.0.0.0"
port: 5002
debug: false
# Logging
logging:
level: INFO # DEBUG enables profiling
# Pipeline
pipeline:
# OCR API connection
ocr_api:
api_host: localhost
api_port: 8080
api_key: null # or set API_KEY env var
connect_timeout: 300
request_timeout: 300
# Page loader settings
page_loader:
max_tokens: 16384
temperature: 0.01
image_format: JPEG
min_pixels: 12544
max_pixels: 71372800
# Result formatting
result_formatter:
output_format: both # json, markdown, or both
# Layout detection (optional)
enable_layout: false
See config.yaml for all options.
Here are two examples of output formats:
[[{ "index": 0, "label": "text", "content": "...", "bbox_2d": null }]]
# Document Title
Body...
| Table | Content |
| ----- | ------- |
| ... | ... |
you can run example code like:
python examples/example.py
Output structure (one folder per input):
result.json – structured OCR resultresult.md – Markdown resultimgs/ – cropped image regions (when layout mode is enabled)GLM-OCR uses composable modules for easy customization:
| Component | Description |
|---|---|
PageLoader |
Preprocessing and image encoding |
OCRClient |
Calls the GLM-OCR model service |
PPDocLayoutDetector |
PP-DocLayout layout detection |
ResultFormatter |
Post-processing, outputs JSON/Markdown |
You can extend the behavior by creating custom pipelines:
from glmocr.dataloader import PageLoader
from glmocr.ocr_client import OCRClient
from glmocr.postprocess import ResultFormatter
class MyPipeline:
def __init__(self, config):
self.page_loader = PageLoader(config)
self.ocr_client = OCRClient(config)
self.formatter = ResultFormatter(config)
def process(self, request_data):
# Implement your own processing logic
pass
This project is inspired by the excellent work of the following projects and communities:
The Code of this repo is under Apache License 2.0.
The GLM-OCR model is released under the MIT License.
The complete OCR pipeline integrates PP-DocLayoutV3 for document layout analysis, which is licensed under the Apache License 2.0. Users should comply with both licenses when using this project.
I am reminded it's basically impossible to read cursive writing in a language you don't know even if it's the same alphabet.