**Revolutionizing Document OCR: DeepSeek AI’s Breakthrough Technology**
Remember the last time you had to spend hours sifting through a stack of papers, deciphering handwritten notes, and wrestling with formatting nightmares? Those days are numbered! Introducing DeepSeek-OCR 2, the latest innovation from DeepSeek AI that’s about to change the game for document processing.
**How Did We Get Here?**
DeepSeek-OCR 2 is an open-source document OCR and understanding system that’s redefining the way we process documents. We’ve taken a fundamentally different approach by leveraging DeepEncoder V2, a language model-style transformer that converts a two-dimensional page into a one-dimensional sequence of visualized tokens, already aligned with a realized reading order.
This is a significant departure from the traditional method of flattening images into a raster sequence. Most multimodal models do this and apply a transformer with static positional encodings, but it’s a poor match for documents with multi-column layouts, nested tables, and mixed language areas. Humans, on the other hand, exhibit a semantic order that jumps between areas, just like DeepSeek-OCR 2!
**The Vision Tokenizer: The Unsung Hero**
The vision tokenizer is inherited from DeepSeek-OCR and uses an 80M parameter SAM base spine followed by two convolution layers. It takes the image and downsizes it, reducing the visual token count by a factor of 16 and compressing features into an embedding dimension of 896.
But here’s the kicker: DeepSeek-OCR 2 uses a multi-crop strategy to cover dense pages without letting the token count explode. A global view at 1024 × 1024 resolution produces 256 tokens, and then up to 6 local crops at 768 × 768 resolution add 144 tokens each. The visual token count ranges from 256 to 1120 per page – just what we needed!
**DeepEncoder-V2: The Game-Changer**
DeepEncoder-V2 is built by instantiating a Qwen2-0.5B model transformer as the vision encoder. The input sequence is created by adding a sequence of visual tokens from the tokenizer as the prefix and a set of learnable “causal flow” tokens as the suffix.
**The Training Pipeline**
The training data pipeline follows DeepSeek-OCR and focuses on OCR-intensive content, with OCR data accounting for 80% of the mix. We rebalance the sampling across text, formulas, and tables using a 3:1:1 ratio, ensuring the model sees enough structure-heavy examples.
**Benchmarking Results on OmniDocBench**
Our primary analysis uses OmniDocBench-v1.5, a benchmarking tool that includes 1355 pages in 9 document classes in Chinese and English. The results are astonishing! DeepSeek-OCR 2 achieves an overall OmniDocBench score of 91.09, a gain of 3.73 points over the original DeepSeek-OCR baseline, and does it with a slightly lower token budget.
**Key Takeaways**
* DeepSeek-OCR 2 replaces a CLIP ViT model encoder with DeepEncoder-V2, a Qwen2-0.5B-based language model encoder that converts a 2D document page into a 1D sequence of causal flow tokens aligned with a realized reading order.
* The vision tokenizer uses an 80M parameter SAM base spine with convolutions, multi-crop global and local views, and keeps the visual token count between 256 and 1120 tokens per page.
* Training follows a 3-stage pipeline: encoder pretraining, joint question enhancement with DeepSeek-3B-A500M, and decoder-only fine-tuning with the encoder frozen.
Want to learn more? Check out the [paper](https://github.com/deepseek-ai/DeepSeek_OCR2_paper.pdf), [repo](https://github.com/deepseek-ai/DeepSeek-OCR-2), and [model weights](https://huggingface.co/deepseek-ai/DeepSeek-OCR-2). And, as always, follow us on Twitter, join our 100k+ ML SubReddit, and subscribe to our Newsletter!
