GLM-OCR: Accurate × Fast × Comprehensive

There are a bunch of new OCR models.

I’ve also heard very good things about these two in particular:

- LightOnOCR-2-1B: https://huggingface.co/lightonai/LightOnOCR-2-1B

- PaddleOCR-VL-1.5: https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5

The OCR leaderboards I’ve seen leave a lot to be desired.

With the rapid release of so many of these models, I wish there were a better way to know which ones are actually the best.

I also feel like most/all of these models don’t handle charts, other than to maybe include a link to a cropped image. It would be nice for the OCR model to also convert charts into markdown tables, but this is obviously challenging.

There are a bunch of new OCR models.

I’ve also heard very good things about these two in particular:

- LightOnOCR-2-1B: https://huggingface.co/lightonai/LightOnOCR-2-1B

- PaddleOCR-VL-1.5: https://huggingface.co/PaddlePaddle/PaddleOCR-VL-1.5

The OCR leaderboards I’ve seen leave a lot to be desired.

With the rapid release of so many of these models, I wish there were a better way to know which ones are actually the best.

There was so many OCR models released in the past few months, all VLM models and yet none of them handle Korean well. Every time I try with a random screenshot (not a A4 document) they just fail at a "simple" task. And funnily enough Qwen3 8B VL is the best model that usually get it right (although I couldn't get the bbox quite well). Even more funny, whatever is running on an iphone locally on cpu is insanely good, same with google's OCR api. I don't know why we don't get more of the traditional OCR stuff. Paddlepaddle v5 is the closest I could find. At this point, I feel like I might be doing something wrong with those VLMs.

This is actually the thing I really desperately need. I'm routinely analyzing contracts that were faxed to me, scanned with monstrously poor resolution, wet signed, all kinds of shit. The big LLM providers choke on this raw input and I burn up the entire context window for 30 pages of text. Understandable evals of the quality of these OCR systems (which are moving wicked fast) would be helpful...

And here's the kicker. I can't afford mistakes. Missing a single character or misinterpreting it could be catastrophic. 4 units vacant? 10 days to respond? Signature missing? Incredibly critical things. I can't find an eval that gives me confidence around this.

Text me back when there's a working PDF to EPUB conversion tool. I've been waiting (and searching for one) long enough. :D

EDIT: https://github.com/overcuriousity/pdf2epub looks interesting.

This might be a niche question, but does glm-ocr (or other libraries) have the ability to extract/interpret QR code data?

I've been trying different OCR models on what should be very simple - subtitles (these are simple machine-rendered text). While all models do very well (95+% accuracy), I haven't seen a model not occasionally make very obvious mistakes. Maybe it will take a different approach to get the last 1%...

What's the current SOTA for Japanese and Korean OCR? BalloonsTranslator has a great workflow but the models are pretty old.

Is it possible for such a small model to outperform gemini 3 or is this a case of benchmarks not showing the reality? I would love to be hopeful, but so far an open source model was never better than a closed one even when benchmarks were showing that.

Has anyone experiment with using VLM to detect "marks"? Thinking of pen/pencil based markings like underlines, circles,checkmarks.. Can these models do it?

I tested this pretty extensively and it has a common failure mode that prevents me from using: extracting footnotes and similar from the full text of academic works. For some reason, many of these models are trained in a way that results in these being excluded, despite these document sections often containing import details and context. Both versions of DeepseekOCR have the same problem. Of the others I’ve tested, dot-ocr in layout mode works best (but is slow) and then datalab’s chandra model (which is larger and has bad license constraints).

I have been trying to catch up with recent OCR developments too. My documents have enough special requirements that public benchmarks didn't tell me enough to decide. Instead I'm building a small document OCR project with visualization tools for comparing bounding boxes, extracted text, region classification, etc. GLM-OCR is my favorite so far [1]. Apple's VisionKit is very good at text recognition, and fast, but it doesn't do high level layout detection and it only works on Apple hardware. It's another useful source of data for cross-validation if you can run it.

This project has been pretty easy to build with agentic coding. It's a Frankenstein monster of glue code and handling my particular domain requirements, so it's not suitable for public release. I'd encourage some rapid prototyping after you've spent an afternoon catching up on what's new. I did a lot of document OCR and post-processing with commercial tools and custom code 15 years ago. The advent of small local VLMs has made it practical to achieve higher accuracy and more domain customization than I would have previously believed.

[1] If you're building an advanced document processing workflow, be sure to read the post-processing code in the GLM code repo. They're doing some non-trivial logic to fuse layout areas and transform text for smooth reading. You probably want to store the raw model results and customize your own post-processing for uncommon languages or uncommon domain vocabulary. Layout is also easier to validate if you bypass their post-processing; it can make some combined areas "disappear" from the layout data.

I'm going to be the obnoxious person who asks you to please create this leaderboard because you care and have a modicum of knowledge in this space.

How do these compare to something like Tesseract?

I remember that one clearing the scoreboard for many years, and usually it's the one I grab for OCR needs due to its reputation.

Are there leaderboards that you follow or trust?

Also, do you have preferred OCR models in your experience? I've had some success with dots.OCR, but I'm only beginning to need to work with OCR.

is https://www.ocrarena.ai/ not accurate?

Text me back when there's a working PDF to EPUB conversion tool. I've been waiting (and searching for one) long enough. :D

EDIT: https://github.com/overcuriousity/pdf2epub looks interesting.

This might be a niche question, but does glm-ocr (or other libraries) have the ability to extract/interpret QR code data?

What's the current SOTA for Japanese and Korean OCR? BalloonsTranslator has a great workflow but the models are pretty old.

Chrome ships a local OCR model for text extraction from PDFs which is better than any of the VLM or open source OCR models i've tried. I had a few hundred gigs of old newspaper scans and after trying all the other options I ended up building a wrapper around the DLL it uses to get the text and bboxes. Performance and accuracy on another level compared to tesseract, and while VLM models sometimes produced good results they just seemed unreliable.

I've thought of open sourcing the wrapper but havent gotten around to it yet. I bet claude code can build a functioning prototype if you just point it to "screen_ai" dir under chrome's user data.

I remember someone building a meme search engine for millions of images using a cluster of used iPhone SE's because of Apple's very good and fast OCR capabilities. Quite an interesting read as well: https://news.ycombinator.com/item?id=34315782

If you want OCR with the big LLM providers, you should probably be passing one page per request. Having the model focus on OCR for only a single page at a time seemed to help a lot in my anecdotal testing a few months ago. You can even pass all the pages in parallel in separate requests, and get the better quality response much faster too.

But, as others said, if you can't afford mistakes, then you're going to need a human in the loop to take responsibility.

I'm keeping my eye on progress in this area as well. I need to free engineering design data from tens of thousands of PDF pages and make them easily and quickly accessible to LLMs.

If your needs are that sensitive, I doubt you'll find anything anytime soon that doesn't require a human in the loop. Even SOTA models only average 95% accuracy on messy inputs. If that's a per character accuracy (which OCR is generally measured by), that's going to be 5+ errors per page of 100+ words. If you really can't afford mistakes you have to consider the OCR inaccurate. If you have key components like "days to respond" and "units vacant" you need to identify the presence of those specifically with bias in favor of false positives (over false negatives), and human confirmation of the source-> OCR.

I’m sure you’ve tried all this but you’ve tried inter-rater agreement via multiple attempts on same LLM vs different LLM? Perhaps your system would work better if you ran it through 5 models 3 times and then highlighted diffs for human chooser.

Deciphering fax messages? What is this, the 90s?

Off the top of my head: for a lot of OCR tasks, it’s kind of worse for the model to be smart. I don’t want my OCR to make stuff up or answer questions — I want to to recognize what is actually on the page.

No. Gemini is clearly the leader across the board: https://www.ocrarena.ai/leaderboard

Has anyone experiment with using VLM to detect "marks"? Thinking of pen/pencil based markings like underlines, circles,checkmarks.. Can these models do it?

None of them do it well from our experience. We had to write our own custom pipeline with a mixture of legacy CV approaches to handle this (AI contract analysis). We constantly benchmark every new multimodal and VLM model that comes out and are consistently disappointed.

I have been looking for an OCR model that can accurately handle footnotes. It’s essential for processing legal texts in particular, which often have footnotes that break across pages. Sadly I’ve yet to encounter a good solution.

I found Mathpix to be quite good with this type of documents, including footnotes but to be fair my documents did not have that many. It’s also proprietary.

I'm going to be the obnoxious person who asks you to please create this leaderboard because you care and have a modicum of knowledge in this space.

I found Mathpix to be quite good with this type of documents, including footnotes but to be fair my documents did not have that many. It’s also proprietary.

How do these compare to something like Tesseract?

I remember that one clearing the scoreboard for many years, and usually it's the one I grab for OCR needs due to its reputation.

Tesseract does not understand layout. It’s fine for character recognition, but if I still have to pipe the output to a LLM to make sense of the layout and fix common transcription errors, I might as well use a single model. It’s also easier for a visual LLM to extract figures and tables in one pass.

Tesseract v4 when it was released was exceptionally good and blew everything out of the water. Have used it to OCR millions of pages. Tbh, I miss the simplicity of tesseract.

The new models are similarly better compared to tesseract v4. But what I'll say is that don't expect new models to be a panacea for your OCR problems. The edge case problems that you might be trying to solve (like, identifying anchor points, or identifying shared field names across documents) are still pretty much all problematic still. So you should still expect things like random spaces or unexpected characters to jam up your jams.

Also some newer models tend to hallucinate incredibly aggressively. If you've ever seen an LLM get stuck in an infinite, think of that.

Are there leaderboards that you follow or trust?

Also, do you have preferred OCR models in your experience? I've had some success with dots.OCR, but I'm only beginning to need to work with OCR.

> Are there leaderboards that you follow or trust?

Not for OCR.

Regardless of how much some people complain about them, I really do appreciate the effort Artificial Analysis puts into consistently running standardized benchmarks for LLMs, rather than just aggregating unverified claims from the AI labs.

I don't think LMArena is that amazing at this point in time, but at least they provide error bars on the ELO and give models the same rank number when they're overlapping.

> Also, do you have preferred OCR models in your experience?

It's a subject I'm interested in, but I don't have enough experience to really put out strong opinions on specific models.

is https://www.ocrarena.ai/ not accurate?

ELO scores for OCR don't really make much sense - it's trying to reduce accuracy to a single voting score without any real quality-control on the reviewer/judge.

I think a more accurate reflection of the current state of comparisons would be a real-world benchmark with messy/complex docs across industries, languages.

It is missing both models that I mentioned, so yes, I would say one reason it is not accurate is because it is so incomplete.

It also doesn't provide error bars on the ELO, so models that only have tens of battles are being listed alongside models that have thousands of battles with no indication of how confident those ELOs are, which I find rather unhelpful.

A lot of these models are also sensitive to how they are used, and offer multiple ways to be used. It's not clear how they are being invoked.

That leaderboard is definitely one of the ones that leaves a lot to be desired.

But, as others said, if you can't afford mistakes, then you're going to need a human in the loop to take responsibility.

Gemini Pro 3 seems to be built for handling multiple page PDFs.

I can feed it a multiple page PDF and tell it to convert it to markdown and it does this well. I don't need to load the pages one at a time as long as I use the PDF format. (This was tested on A.i. studio but I think the API works the same way).

You could maybe then do a second pass on the whole text (as plain text not OCR) to look for likely mistakes.

I've thought of open sourcing the wrapper but havent gotten around to it yet. I bet claude code can build a functioning prototype if you just point it to "screen_ai" dir under chrome's user data.

What's the name of this DLL? I assume it's separate from the monster chrome.dll, and that the model is proprietary.

Surprisingly, I have a few hundred gigs of old newspaper scans so am very curious.

How fast was it per page? Do you recall if it's CPU or GPU based? TY!

Apple OCR even on the Mac is insanely good, in fact way better than AWS textract/GCP cloud vision OCR.

Any idea what model is being used?

I'm keeping my eye on progress in this area as well. I need to free engineering design data from tens of thousands of PDF pages and make them easily and quickly accessible to LLMs.

All of healthcare is crying. Trust me.

> If you really can't afford mistakes you have to consider the OCR inaccurate.

Isn’t this close to the error rate of human transcription for messy input, though? I seem to remember a figure in that ballpark. I think if your use case is this sensitive, then any transcription is suspicious.

Deciphering fax messages? What is this, the 90s?

We have decades of internal reports on film that we’d like to make accessible and searchable. We don’t do it with new documents, but we have a huge backlog.

Fax is still hard to hack, so some organizations have kept it alive for security.

Tesseract v4 when it was released was exceptionally good and blew everything out of the water. Have used it to OCR millions of pages. Tbh, I miss the simplicity of tesseract.

Also some newer models tend to hallucinate incredibly aggressively. If you've ever seen an LLM get stuck in an infinite, think of that.

ELO scores for OCR don't really make much sense - it's trying to reduce accuracy to a single voting score without any real quality-control on the reviewer/judge.

I think a more accurate reflection of the current state of comparisons would be a real-world benchmark with messy/complex docs across industries, languages.

> Are there leaderboards that you follow or trust?

Not for OCR.

I don't think LMArena is that amazing at this point in time, but at least they provide error bars on the ELO and give models the same rank number when they're overlapping.

> Also, do you have preferred OCR models in your experience?

It's a subject I'm interested in, but I don't have enough experience to really put out strong opinions on specific models.

It is missing both models that I mentioned, so yes, I would say one reason it is not accurate is because it is so incomplete.

A lot of these models are also sensitive to how they are used, and offer multiple ways to be used. It's not clear how they are being invoked.

That leaderboard is definitely one of the ones that leaves a lot to be desired.

We have decades of internal reports on film that we’d like to make accessible and searchable. We don’t do it with new documents, but we have a huge backlog.

No. Gemini is clearly the leader across the board: https://www.ocrarena.ai/leaderboard

For my workflows, layout extraction has been so inconsistent that I've stopped attempting to use it. It's simpler to just throw everything into postgis and run intersection checks on size-normalized pages.

I don't know how, but PyMuPDF4LLM is based on Tessaract and has GNN-based layout detection

Gemini Pro 3 seems to be built for handling multiple page PDFs.

It's not that they can't do multiple pages... but did you compare against doing one page at a time?

How many pages did you try in a single request? 5? 50? 500?

I fully believe that 5 pages of input works just fine, but this does not scale up to larger documents, and the goal of OCR is usually to know what is actually written on the page... not what "should" have been written on the page. I think a larger number of pages makes it more likely for the LLM to hallucinate as it tries to "correct" errors that it sees, which is not the task. If that is a desirable task, I think it would be better to post-process the document with an LLM after it is converted to text, rather than asking the LLM to both read a large number of images and correct things at the same time, which is asking a lot.

Once the document gets long enough, current LLMs will get lazy and stop providing complete OCR for every page in their response.

One page at a time keeps the LLM focused on the task, and it's easy to parallelize so entire documents can be OCR'd quickly.

What's the name of this DLL? I assume it's separate from the monster chrome.dll, and that the model is proprietary.

chrome_screen_ai.dll is the name of the dll (libchromescreenai.so on linux) and yes it is proprietary. It isn't included by default, Chrome uses its component service to download it automatically when you open a PDF file that doesn't have pre-existing OCR'd text on it. You can download it separately from here: https://chrome-infra-packages.appspot.com/p/chromium/third_p...

You could maybe then do a second pass on the whole text (as plain text not OCR) to look for likely mistakes.

This is not always easy. The models I tried were too helpful and rewrote too much instead of fixing simple typos. When I tried I ended up with huge prompts and I still found sentences where the LLM was too enthusiastic. I ended up applying regexes with common typos and accepted some residual errors. It might be better now, though. But since then I’ve moved to all-in-one solutions like Mathpix and Mistral-OCR which are quite good for my purpose.

Apple OCR even on the Mac is insanely good, in fact way better than AWS textract/GCP cloud vision OCR.

Any idea what model is being used?

Probably some custom model built for their hardware.

Surprisingly, I have a few hundred gigs of old newspaper scans so am very curious.

How fast was it per page? Do you recall if it's CPU or GPU based? TY!

It is CPU-based. Somewhere between 1 to 2 seconds per page on a single core. I ran 20 instances of it in parallel to utilize 20 CPU cores so the avg time came down nicely.

> If you really can't afford mistakes you have to consider the OCR inaccurate.

This is precisely the real question. If you're exceeding human transcription, you may be generally pretty good. The question is what happens when you tell a human to become surgical about some part of the document, how then does the comparison change..

All of healthcare is crying. Trust me.

I suppose tears of joy?

Interesting. Won't stuff like entity extraction suffer? Especially in multilingual use cases. My worry is that a smaller model might not realize some text is actually a persons name because it is very unusual.

I don't know how, but PyMuPDF4LLM is based on Tessaract and has GNN-based layout detection

Probably some custom model built for their hardware.

Fax is still hard to hack, so some organizations have kept it alive for security.

Sometimes what is on the page is ambiguous. Imagine a scan where the dot over the i is missing in a word like "this". What's on the page is "thls" but to transcribe it that way would be an error outside of forensic contexts.

I am reminded it's basically impossible to read cursive writing in a language you don't know even if it's the same alphabet.

If someone releases a benchmark/dataset, I'm sure that significantly increases the chances of one of these AI labs training on the task.

Interesting. What kind of layout do you have?

My documents have one or two-column layouts, often inconsistently across pages or even within a page (which tripped older layout detection methods). Most models seem to understand that well enough so they are good enough for my use case.

It's not that they can't do multiple pages... but did you compare against doing one page at a time?

How many pages did you try in a single request? 5? 50? 500?

Once the document gets long enough, current LLMs will get lazy and stop providing complete OCR for every page in their response.

One page at a time keeps the LLM focused on the task, and it's easy to parallelize so entire documents can be OCR'd quickly.

I've been doing small PDFs- usually 5 or 6 pages in length.

I never tested Gemini 3 PDF OCR compared to individual images but I can say it processes a small 6 page PDF better than the retired Gemini 1.5 or 2 did individual images.

I agree that OCR and analysis should be two separate steps.

I suppose tears of joy?

Of sadness because they're not allowed to use it yet.

If someone releases a benchmark/dataset, I'm sure that significantly increases the chances of one of these AI labs training on the task.

Of sadness because they're not allowed to use it yet.

I've been doing small PDFs- usually 5 or 6 pages in length.

I never tested Gemini 3 PDF OCR compared to individual images but I can say it processes a small 6 page PDF better than the retired Gemini 1.5 or 2 did individual images.

I agree that OCR and analysis should be two separate steps.

The model does not need to be that smart to understand that a name it does not know that starts with a capital letter is a the name of a place or a person. It does not need to be aware of whom this refers to, it just needs to transcribe it.

Also, there are generalist models that have enough of a grasp of a dozen or so languages that fit comfortably in 7B parameters. Like the older Mistral, which had the best multi-lingual support at the time, but newer models around that size are probably good candidates. I am not surprised that a multilingual specialised model can fit in 8B or so.

Interesting. What kind of layout do you have?

Documents that come from FOIA. So, some scanned, some not. Lots of forms and lots of hand writing to add info that the form format doesn't recognize. Lots of repeated documents, but lots of one-off documents that have high signal.

GLM-OCR

👋 Join our WeChat and Discord community
📍 Use GLM-OCR's API

Model Introduction

GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.

Key Features

State-of-the-Art Performance: Achieves a score of 94.62 on OmniDocBench V1.5, ranking #1 overall, and delivers state-of-the-art results across major document understanding benchmarks, including formula recognition, table recognition, and information extraction.
Optimized for Real-World Scenarios: Designed and optimized for practical business use cases, maintaining robust performance on complex tables, code-heavy documents, seals, and other challenging real-world layouts.
Efficient Inference: With only 0.9B parameters, GLM-OCR supports deployment via vLLM, SGLang, and Ollama, significantly reducing inference latency and compute cost, making it ideal for high-concurrency services and edge deployments.
Easy to Use: Fully open-sourced and equipped with a comprehensive SDK and inference toolchain, offering simple installation, one-line invocation, and smooth integration into existing production pipelines.

Download Model

Model	Download Links	Precision
GLM-OCR	🤗 Hugging Face 🤖 ModelScope	BF16

GLM-OCR SDK

We provide an SDK for using GLM-OCR more efficiently and conveniently.

Install SDK

UV Installation

# Install from source
git clone https://github.com/zai-org/glm-ocr.git
cd glm-ocr
uv venv --python 3.12 --seed && source .venv/bin/activate
uv pip install -e .

Model Deployment

Two ways to use GLM-OCR:

Option 1: Zhipu MaaS API (Recommended for Quick Start)

Use the hosted cloud API – no GPU needed. The cloud service runs the complete GLM-OCR pipeline internally, so the SDK simply forwards your request and returns the result.

Get an API key from https://open.bigmodel.cn
Configure config.yaml:

pipeline:
  maas:
    enabled: true # Enable MaaS mode
    api_key: your-api-key # Required

That's it! When maas.enabled=true, the SDK acts as a thin wrapper that:

Forwards your documents to the Zhipu cloud API
Returns the results directly (Markdown + JSON layout details)
No local processing, no GPU required

Input note (MaaS): the upstream API accepts file as a URL or a data:<mime>;base64,... data URI. If you have raw base64 without the data: prefix, wrap it as a data URI (recommended). The SDK will auto-wrap local file paths / bytes / raw base64 into a data URI when calling MaaS.

API documentation: https://docs.bigmodel.cn/cn/guide/models/vlm/glm-ocr

Option 2: Self-host with vLLM / SGLang

Deploy the GLM-OCR model locally for full control. The SDK provides the complete pipeline: layout detection, parallel region OCR, and result formatting.

Using vLLM

Install vLLM:

uv pip install -U vllm --torch-backend=auto --extra-index-url https://wheels.vllm.ai/nightly
# Or use Docker
docker pull vllm/vllm-openai:nightly

Launch the service:

# In docker container, uv may not be need for transformers install
uv pip install git+https://github.com/huggingface/transformers.git

# Run with MTP for better performance
vllm serve zai-org/GLM-OCR --allowed-local-media-path / --port 8080 --speculative-config '{"method": "mtp", "num_speculative_tokens": 1}' --served-model-name glm-ocr

Using SGLang

Install SGLang:

docker pull lmsysorg/sglang:dev
# Or build from source
uv pip install git+https://github.com/sgl-project/sglang.git#subdirectory=python

Launch the service:

# In docker container, uv may not be need for transformers install
uv pip install git+https://github.com/huggingface/transformers.git

# Run with MTP for better performance
python -m sglang.launch_server --model zai-org/GLM-OCR --port 8080 --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --served-model-name glm-ocr
# Modify the speculative config base on your device

Update Configuration

After launching the service, configure config.yaml:

pipeline:
  maas:
    enabled: false # Disable MaaS mode (default)
  ocr_api:
    api_host: localhost # or your vLLM/SGLang server address
    api_port: 8080

Option 3: Ollama/MLX

For specialized deployment scenarios, see the detailed guides:

Apple Silicon with mlx-vlm - Optimized for Apple Silicon Macs
Ollama Deployment - Simple local deployment with Ollama

SDK Usage Guide

CLI

# Parse a single image
glmocr parse examples/source/code.png

# Parse a directory
glmocr parse examples/source/

# Set output directory
glmocr parse examples/source/code.png --output ./results/

# Use a custom config
glmocr parse examples/source/code.png --config my_config.yaml

# Enable debug logging with profiling
glmocr parse examples/source/code.png --log-level DEBUG

Python API

from glmocr import GlmOcr, parse

# Simple function
result = parse("image.png")
result = parse(["img1.png", "img2.jpg"])
result = parse("https://example.com/image.png")
result.save(output_dir="./results")

# Note: a list is treated as pages of a single document.

# Class-based API
with GlmOcr() as parser:
    result = parser.parse("image.png")
    print(result.json_result)
    result.save()

Flask Service

# Start service
python -m glmocr.server

# With debug logging
python -m glmocr.server --log-level DEBUG

# Call API
curl -X POST http://localhost:5002/glmocr/parse \
  -H "Content-Type: application/json" \
  -d '{"images": ["./example/source/code.png"]}'

Semantics:

images can be a string or a list.
A list is treated as pages of a single document.
For multiple independent documents, call the endpoint multiple times (one document per request).

Configuration

Full configuration in glmocr/config.yaml:

# Server (for glmocr.server)
server:
  host: "0.0.0.0"
  port: 5002
  debug: false

# Logging
logging:
  level: INFO # DEBUG enables profiling

# Pipeline
pipeline:
  # OCR API connection
  ocr_api:
    api_host: localhost
    api_port: 8080
    api_key: null # or set API_KEY env var
    connect_timeout: 300
    request_timeout: 300

  # Page loader settings
  page_loader:
    max_tokens: 16384
    temperature: 0.01
    image_format: JPEG
    min_pixels: 12544
    max_pixels: 71372800

  # Result formatting
  result_formatter:
    output_format: both # json, markdown, or both

  # Layout detection (optional)
  enable_layout: false

See config.yaml for all options.

Output Formats

Here are two examples of output formats:

JSON

[[{ "index": 0, "label": "text", "content": "...", "bbox_2d": null }]]

Markdown

# Document Title

Body...

| Table | Content |
| ----- | ------- |
| ...   | ...     |

Example of full pipeline

you can run example code like：

python examples/example.py

Output structure (one folder per input):

result.json – structured OCR result
result.md – Markdown result
imgs/ – cropped image regions (when layout mode is enabled)

Modular Architecture

GLM-OCR uses composable modules for easy customization:

Component	Description
`PageLoader`	Preprocessing and image encoding
`OCRClient`	Calls the GLM-OCR model service
`PPDocLayoutDetector`	PP-DocLayout layout detection
`ResultFormatter`	Post-processing, outputs JSON/Markdown

You can extend the behavior by creating custom pipelines:

from glmocr.dataloader import PageLoader
from glmocr.ocr_client import OCRClient
from glmocr.postprocess import ResultFormatter


class MyPipeline:
  def __init__(self, config):
    self.page_loader = PageLoader(config)
    self.ocr_client = OCRClient(config)
    self.formatter = ResultFormatter(config)

  def process(self, request_data):
    # Implement your own processing logic
    pass

Acknowledgement

This project is inspired by the excellent work of the following projects and communities:

License

The Code of this repo is under Apache License 2.0.

The GLM-OCR model is released under the MIT License.

The complete OCR pipeline integrates PP-DocLayoutV3 for document layout analysis, which is licensed under the Apache License 2.0. Users should comply with both licenses when using this project.