Tencent Hunyuan has launched HunyuanOCR, a 1B parameter imaginative and prescient language mannequin that’s specialised for OCR and doc understanding. The mannequin is constructed on Hunyuan’s native multimodal structure and runs recognizing, parsing, info extraction, visible query answering, and textual content picture translation via a single finish to finish pipeline.
HunyuanOCR is a light-weight different to common VLMs corresponding to Gemini 2.5 and Qwen3 VL that also matches or surpasses them on OCR centric duties. It targets manufacturing use circumstances like doc parsing, card and receipt extraction, video subtitle extraction, and multilingual doc translation.
Structure, Native Decision ViT plus Light-weight LLM
HunyuanOCR makes use of 3 essential modules, a Native Decision Visible Encoder referred to as Hunyuan ViT, an Adaptive MLP Connector, and a Light-weight Language Mannequin. The encoder relies on SigLIP-v2-400M and is prolonged to assist arbitrary enter resolutions via adaptive patching that preserves the unique side ratio. Photographs are cut up into patches in accordance with their native proportions and processed with world consideration, which improves recognition on lengthy textual content strains, lengthy paperwork, and low high quality scans.
The Adaptive MLP Connector performs learnable pooling on the spatial dimension. It compresses the dense visible tokens right into a shorter sequence, whereas protecting info from textual content dense areas. This reduces sequence size handed to the language mannequin and lowers compute, whereas preserving OCR related particulars.
The language mannequin relies on the densely architected Hunyuan 0.5B mannequin and makes use of XD RoPE. XD RoPE splits rotary place embeddings into 4 subspaces for textual content, top, width, and time. This provides the mannequin a local solution to align 1D token order with 2D format and 3D spatiotemporal construction. In consequence, the identical stack can deal with multi column pages, cross web page flows, and sequences of video frames.
Coaching and inference comply with a completely finish to finish paradigm. There isn’t any exterior format evaluation or publish processing mannequin within the loop. All duties are expressed as pure language prompts and dealt with in a single ahead cross. This design removes error propagation throughout pipeline levels and simplifies deployment.
Knowledge and Pre Coaching Recipe
The info pipeline builds greater than 200M picture textual content pairs, throughout 9 actual world eventualities, together with avenue views, paperwork, commercials, handwritten textual content, screenshots, playing cards and certificates and invoices, sport interfaces, video frames, and creative typography. The corpus covers greater than 130 languages.
Artificial knowledge comes from a multilingual generator that helps proper to left scripts and paragraph degree rendering. The pipeline controls font, language, rotation, and RGB values, and applies warping, blur, and native lighting modifications to simulate cell captures and different arduous circumstances.
Pre coaching follows 4 levels. Stage-1 performs imaginative and prescient language alignment with pure textual content, artificial parsing and recognition knowledge, and common caption knowledge, utilizing 50B tokens and 8k context. Stage-2 runs multimodal pre coaching on 300B tokens that blend pure textual content with artificial recognizing, parsing, translation, and VQA samples. Stage-3 extends context size to 32k with 80B tokens targeted on lengthy paperwork and lengthy textual content. Stage-4 is utility oriented supervised high quality tuning on 24B tokens of human annotated and arduous unfavourable knowledge, protecting 32k context and unified instruction templates.
Reinforcement Studying with Verifiable Rewards
After supervised coaching, HunyuanOCR is additional optimized with reinforcement studying. The analysis workforce use Group Relative Coverage Optimization GRPO and a Reinforcement Studying with Verifiable Rewards setup for structured duties. For textual content recognizing, the reward relies on intersection over union matching of containers mixed with normalized edit distance over textual content. For doc parsing, the reward makes use of normalized edit distance between the generated construction and the reference.
For VQA and translation, the system makes use of an LLM as a decide. VQA makes use of a binary reward that checks semantic match. Translation makes use of a COMET type scoring LLM with scores in [0, 5], normalized to [0, 1]. The coaching framework enforces size limits and strict codecs, and assigns zero reward when outputs overflow or break schema, which stabilizes optimization and encourages legitimate JSON or structured outputs.
Benchmark Outcomes, a 1B Mannequin Competing with Bigger VLMs
On the interior textual content recognizing benchmark of 900 pictures throughout 9 classes, HunyuanOCR reaches an general rating of 70.92. It outperforms conventional pipeline strategies like PaddleOCR and BaiduOCR and likewise common VLMs corresponding to Gemini 2.5 Professional, Qwen3 VL 2B, Qwen3 VL 235B, and Seed 1.6 Imaginative and prescient, regardless of utilizing far fewer parameters.
On OmniDocBench, HunyuanOCR achieves 94.10 general, with 94.73 on formulation and 91.81 on tables. On the Wild OmniDocBench variant, which prints and recaptures paperwork below folds and lighting modifications, it scores 85.21 general. On DocML, a multilingual parsing benchmark throughout 14 non Chinese language and non English languages, it reaches 91.03, and the paper reviews cutting-edge outcomes throughout all 14 languages.
For info extraction and VQA, HunyuanOCR reaches 92.29 accuracy on playing cards, 92.53 on receipts, and 92.87 on video subtitles. On OCRBench, it scores 860, larger than DeepSeek OCR at comparable scale and near bigger common VLMs like Qwen3 VL 2B Instruct and Gemini 2.5 Professional.
In textual content picture translation, HunyuanOCR makes use of the DoTA benchmark and a DocML primarily based inner set. It achieves a robust COMET rating on DoTA for English to Chinese language doc translation, and the mannequin wins first place in Monitor 2.2 OCR free Small Mannequin of the ICDAR 2025 DIMT competitors.
Key Takeaways
- Compact finish to finish OCR VLM: HunyuanOCR is a 1B parameter OCR targeted imaginative and prescient language mannequin that connects a 0.4B native decision ViT to a 0.5B Hunyuan language mannequin via an MLP adapter, and runs recognizing, parsing, info extraction, VQA and translation in a single finish to finish instruction pushed pipeline with out exterior format or detection modules.
- Unified assist for numerous OCR eventualities: The mannequin is educated on greater than 200M picture textual content pairs throughout 9 eventualities, together with paperwork, avenue views, commercials, handwritten content material, screenshots, playing cards and invoices, sport interfaces and video frames, with protection of over 130 languages in coaching and assist for greater than 100 languages in deployment.
- Knowledge pipeline plus reinforcement studying: Coaching makes use of a 4 stage recipe, imaginative and prescient language alignment, multimodal pre coaching, lengthy context pre coaching and utility oriented supervised high quality tuning, adopted by reinforcement studying with group relative coverage optimization and verifiable rewards for recognizing, parsing, VQA and translation.
- Sturdy benchmark outcomes for sub 3B fashions
HunyuanOCR reaches 94.1 on OmniDocBench for doc understanding, and achieves 860 on OCRBench, which is reported as cutting-edge amongst imaginative and prescient language fashions with fewer than 3B parameters, whereas additionally outperforming a number of industrial OCR APIs and bigger open fashions corresponding to Qwen3 VL 4B on core OCR benchmarks.
Editorial Notes
HunyuanOCR is a robust sign that OCR particular VLMs are maturing into sensible infrastructure, not simply benchmarks. Tencent combines a 1B parameter finish to finish structure with Native Imaginative and prescient Transformer, Adaptive MLP Connector and RL with verifiable rewards to ship a single mannequin that covers recognizing, parsing, IE, VQA and translation throughout greater than 100 languages, and it does so whereas reaching main scores on OCRBench for sub 3B fashions and 94.1 on OmniDocBench. Total, HunyuanOCR marks an essential shift towards compact, instruction pushed OCR engines which are real looking for manufacturing deployment.
Take a look at the Paper, Model weight and Repo. Be at liberty to take a look at our GitHub Page for Tutorials, Codes and Notebooks. Additionally, be happy to comply with us on Twitter and don’t neglect to affix our 100k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.
Asif Razzaq is the CEO of Marktechpost Media Inc.. As a visionary entrepreneur and engineer, Asif is dedicated to harnessing the potential of Synthetic Intelligence for social good. His most up-to-date endeavor is the launch of an Synthetic Intelligence Media Platform, Marktechpost, which stands out for its in-depth protection of machine studying and deep studying information that’s each technically sound and simply comprehensible by a large viewers. The platform boasts of over 2 million month-to-month views, illustrating its reputation amongst audiences.
