Zhipu AI Introduces GLM-OCR: A 0.9B Multimodal OCR Mannequin for Doc Parsing and Key Data Extraction (KIE)

Why Doc OCR Nonetheless Stays a Exhausting Engineering Drawback? What does it take to make OCR helpful for actual paperwork as an alternative of fresh demo pictures? And may a compact multimodal mannequin deal with parsing, tables, formulation, and structured extraction with out turning inference right into a useful resource bonfire?

That’s the drawback focused by GLM-OCR, launched by researchers from Zhipu AI and Tsinghua College. The analysis workforce presents GLM-OCR as a 0.9B-parameter compact multimodal mannequin for doc understanding. It combines a 0.4B CogViT visible encoder, a light-weight cross-modal connector, and a 0.5B GLM language decoder. The acknowledged objective is to steadiness doc recognition high quality with decrease latency and decrease computational value than bigger multimodal methods.

Conventional OCR methods are sometimes good at plain textual content transcription, however they battle when paperwork include combined layouts, tables, formulation, code blocks, seals, and structured fields. Latest multimodal massive language fashions enhance doc understanding, however the analysis workforce argue that their measurement and normal autoregressive decoding make them costly for edge deployment and large-scale manufacturing. GLM-OCR is positioned as a smaller system constructed for these deployment constraints moderately than as a general-purpose vision-language mannequin tailored to OCR as an afterthought.

A Compact Structure Constructed for OCR Workloads

A key technical level for this analysis is using Multi-Token Prediction (MTP). Commonplace autoregressive decoding predicts one token at a time, which isn’t excellent for OCR-style duties the place outputs are sometimes deterministic and domestically structured. GLM-OCR as an alternative predicts a number of tokens per step. The mannequin is skilled to foretell 10 tokens per step and generates 5.2 tokens per decoding step on common at inference time, yielding about 50% throughput enchancment. To maintain reminiscence overhead manageable, the implementation makes use of a parameter-sharing scheme throughout the draft fashions.

Two-Stage Format Parsing As a substitute of Flat Web page Studying

On the system stage, GLM-OCR adopts a two-stage pipeline. The primary stage makes use of PP-DocLayout-V3 for structure evaluation, which detects structured areas on the web page. The second stage performs parallel region-level recognition over these detected areas. That is necessary as a result of the mannequin will not be merely studying an entire web page left-to-right as a generic vision-language mannequin may. It first breaks down the web page into semantically significant areas, which improves effectivity and makes the system extra strong on paperwork with sophisticated layouts.

Doc Parsing and KIE Use Completely different Output Paths

The structure additionally separates two associated doc duties. For doc parsing, the pipeline makes use of structure detection and area processing to provide structured outputs reminiscent of Markdown and JSON. For Key Data Extraction (KIE), the analysis workforce describes a special path: the total doc picture is fed to the mannequin with a job immediate, and the mannequin immediately generates JSON containing the extracted fields. That distinction issues as a result of GLM-OCR will not be offered as a single monolithic page-to-text mannequin. It’s a structured era system with totally different working modes relying on the duty.

A 4-Stage Coaching Pipeline with Process-Particular Rewards

The coaching recipe is break up into 4 levels. Stage 1 trains the imaginative and prescient encoder on image-text pairs and grounding or retrieval information. Stage 2.1 performs multimodal pretraining on image-text, doc parsing, grounding, and VQA information. Stage 2.2 provides the MTP goal. Stage 3 is supervised fine-tuning on OCR-specific duties together with textual content recognition, system transcription, desk construction restoration, and KIE. Stage 4 applies reinforcement studying utilizing GRPO. The reward design is task-specific: Normalized Edit Distance for textual content recognition, CDM rating for system recognition, TEDS rating for desk recognition, and field-level F1 for KIE, together with structural penalties reminiscent of repetition penalties, malformed construction penalties, and JSON validation constraints.

Benchmark Outcomes Present Robust Efficiency, With Vital Caveats

On public benchmarks, GLM-OCR stories sturdy outcomes throughout a number of doc duties. It scores 94.6 on OmniDocBench v1.5, 94.0 on OCRBench (Textual content), 96.5 on UniMERNet, 85.2 on PubTabNet, and 86.0 on TEDS_TEST. For KIE, it stories 93.7 on Nanonets-KIE and 86.1 on Handwritten-KIE. The analysis workforce notes that outcomes for Gemini-3-Professional and GPT-5.2-2025-12-11 are proven just for reference and are excluded from the best-score rating, which is a crucial element when deciphering claims about mannequin management.

https://arxiv.org/pdf/2603.10910

The benchmark story is robust, nevertheless it wants cautious phrasing. GLM-OCR achieves the best reported scores among the many evaluated non-reference fashions on OmniDocBench v1.5, OCRBench (Textual content), UniMERNet, and TEDS_TEST. On PubTabNet, nonetheless, it does not lead total; MinerU 2.5 stories 88.4 versus GLM-OCR’s 85.2. For KIE, GLM-OCR outperforms the listed open-source rivals within the above desk, however Gemini-3-Professional scores increased on each Nanonets-KIE and Handwritten-KIE within the reference column. So the reserach workforce helps a robust aggressive declare, however not a blanket ‘greatest at the whole lot’ declare.

Deployment Particulars

The analysis workforce state that GLM-OCR helps vLLM, SGLang, and Ollama, and will be fine-tuned by way of LLaMA-Manufacturing unit. In addition they report throughput of 0.67 pictures/s and 1.86 PDF pages/s underneath their analysis setup. As well as, they describe a MaaS API priced at 0.2 RMB per million tokens, with instance value estimates for scanned pictures and simple-layout PDFs. These particulars counsel that GLM-OCR is being framed as each a analysis mannequin and a deployable system.

Key Takeaways

GLM-OCR is a compact 0.9B multimodal OCR mannequin constructed with a 0.4B CogViT encoder and 0.5B GLM decoder.
It makes use of Multi-Token Prediction (MTP) to enhance decoding effectivity, reaching 5.2 tokens per step on common and about 50% increased throughput.
The mannequin makes use of a two-stage pipeline: PP-DocLayout-V3 handles structure evaluation, then GLM-OCR performs parallel region-level recognition.
It helps each doc parsing and KIE: parsing outputs Markdown/JSON, whereas KIE immediately generates JSON from the total doc picture.
Benchmark outcomes are sturdy however not common wins: GLM-OCR leads a number of reported non-reference benchmarks, however MinerU 2.5 is increased on PubTabNet, and Gemini-3-Professional is increased on the reference-only KIE scores.

Try Paper, Repo and Model Page. Additionally, be happy to observe us on Twitter and don’t neglect to hitch our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.

Source link

Zhipu AI Introduces GLM-OCR: A 0.9B Multimodal OCR Mannequin for Doc Parsing and Key Data Extraction (KIE)

The 70-Individual AI Picture Startup Taking up Silicon Valley’s Giants

Now you can edit your Instagram feedback

Avec’s Tinder-styled e-mail app means that you can swipe by way of your inbox

Zhipu AI Introduces GLM-OCR: A 0.9B Multimodal OCR Mannequin for Doc Parsing and Key Data Extraction (KIE)

A Compact Structure Constructed for OCR Workloads

Two-Stage Format Parsing As a substitute of Flat Web page Studying

Doc Parsing and KIE Use Completely different Output Paths

A 4-Stage Coaching Pipeline with Process-Particular Rewards

Benchmark Outcomes Present Robust Efficiency, With Vital Caveats

Deployment Particulars

Key Takeaways

Related Posts

The 70-Individual AI Picture Startup Taking up Silicon Valley’s Giants

Now you can edit your Instagram feedback

Avec’s Tinder-styled e-mail app means that you can swipe by way of your inbox