Close Menu
    Facebook X (Twitter) Instagram
    Articles Stock
    • Home
    • Technology
    • AI
    • Pages
      • About us
      • Contact us
      • Disclaimer For Articles Stock
      • Privacy Policy
      • Terms and Conditions
    Facebook X (Twitter) Instagram
    Articles Stock
    AI

    Baidu Qianfan Crew Releases Qianfan-OCR: A 4B-Parameter Unified Doc Intelligence Mannequin

    Naveed AhmadBy Naveed Ahmad19/03/2026Updated:19/03/2026No Comments4 Mins Read
    blog banner23 55






    The Baidu Qianfan Crew launched Qianfan-OCR, a 4B-parameter end-to-end mannequin designed to unify doc parsing, structure evaluation, and doc understanding inside a single vision-language structure. Not like conventional multi-stage OCR pipelines that chain separate modules for structure detection and textual content recognition, Qianfan-OCR performs direct image-to-Markdown conversion and helps prompt-driven duties like desk extraction and doc query answering.

    https://arxiv.org/pdf/2603.13398

    Structure and Technical Specs

    Qianfan-OCR makes use of the multimodal bridging structure from the Qianfan-VL framework. The system consists of three major elements:

    • Imaginative and prescient Encoder (Qianfan-ViT): Employs an Any Decision design that tiles pictures into 448 x 448 patches. It helps variable-resolution inputs as much as 4K, producing as much as 4,096 visible tokens per picture to take care of spatial decision for small fonts and dense textual content.
    • Cross-Modal Adapter: A light-weight two-layer MLP with GELU activation that tasks visible options into the language mannequin’s embedding house.
    • Language Mannequin Spine (Qwen3-4B): A 4.0B-parameter mannequin with 36 layers and a local 32K context window. It makes use of Grouped-Question Consideration (GQA) to cut back KV cache reminiscence utilization by 4x.

    ‘Format-as-Thought’ Mechanism

    The principle function of the mannequin is Format-as-Thought, an non-obligatory considering part triggered by tokens. Throughout this part, the mannequin generates structured structure representations—together with bounding containers, component varieties, and studying order—earlier than producing the ultimate output.

    • Useful Utility: This course of recovers specific structure evaluation capabilities (component localization and kind classification) typically misplaced in end-to-end paradigms.
    • Efficiency Traits: Analysis on OmniDocBench v1.5 signifies that enabling the considering part offers a constant benefit on paperwork with excessive “structure label entropy”—these containing heterogeneous parts like blended textual content, formulation, and diagrams.
    • Effectivity: Bounding field coordinates are represented as devoted particular tokens ( to ), lowering considering output size by roughly 50% in comparison with plain digit sequences.

    Empirical Efficiency and Benchmarks

    Qianfan-OCR was evaluated in opposition to each specialised OCR techniques and basic vision-language fashions (VLMs).

    Doc Parsing and Common OCR

    The mannequin ranks first amongst end-to-end fashions on a number of key benchmarks:

    • OmniDocBench v1.5: Achieved a rating of 93.12, surpassing DeepSeek-OCR-v2 (91.09) and Gemini-3 Professional (90.33).
    • OlmOCR Bench: Scored 79.8, main the end-to-end class.
    • OCRBench: Achieved a rating of 880, rating first amongst all examined fashions.

    On public KIE benchmarks, Qianfan-OCR achieved the very best common rating (87.9), outperforming considerably bigger fashions.

    Mannequin General Imply (KIE) OCRBench KIE Nanonets KIE (F1)
    Qianfan-OCR (4B) 87.9 95.0 86.5
    Qwen3-4B-VL 83.5 89.0 83.3
    Qwen3-VL-235B-A22B 84.2 94.0 83.8
    Gemini-3.1-Professional 79.2 96.0 76.1

    Doc Understanding

    Comparative testing revealed that two-stage OCR+LLM pipelines typically fail on duties requiring spatial reasoning. As an illustration, all examined two-stage techniques scored 0.0 on CharXiv benchmarks, because the textual content extraction part discards the visible context (axis relationships, information level positions) vital for chart interpretation.

    https://arxiv.org/pdf/2603.13398

    Deployment and Inference

    Inference effectivity was measured in Pages Per Second (PPS) utilizing a single NVIDIA A100 GPU.

    • Quantization: With W8A8 (AWQ) quantization, Qianfan-OCR achieved 1.024 PPS, a 2x speedup over the W16A16 baseline with negligible accuracy loss.
    • Structure Benefit: Not like pipeline techniques that depend on CPU-based structure evaluation—which might develop into a bottleneck—Qianfan-OCR is GPU-centric. This avoids inter-stage processing delays and permits for environment friendly large-batch inference.

    Try Paper, Repo and Model on HF. Additionally, be at liberty to comply with us on Twitter and don’t neglect to hitch our 120k+ ML SubReddit and Subscribe to our Newsletter. Wait! are you on telegram? now you can join us on telegram as well.


    Michal Sutter is an information science skilled with a Grasp of Science in Knowledge Science from the College of Padova. With a strong basis in statistical evaluation, machine studying, and information engineering, Michal excels at reworking advanced datasets into actionable insights.






    Earlier articleNVIDIA AI Open-Sources ‘OpenShell’: A Safe Runtime Setting for Autonomous AI Brokers




    Source link

    Naveed Ahmad

    Related Posts

    Nvidia is quietly constructing a multibillion-dollar behemoth to rival its chips enterprise

    19/03/2026

    Why Walmart and OpenAI Are Shaking Up Their Agentic Purchasing Deal

    18/03/2026

    The Gemini-powered options in Google Workspace which might be value utilizing

    18/03/2026
    Leave A Reply Cancel Reply

    Categories
    • AI
    Recent Comments
      Facebook X (Twitter) Instagram Pinterest
      © 2026 ThemeSphere. Designed by ThemeSphere.

      Type above and press Enter to search. Press Esc to cancel.