AI Summary • Published on Oct 20, 2025
Large Language Models (LLMs) face substantial computational difficulties when processing extensive textual information, primarily due to the quadratic scaling of their computational requirements with increasing sequence length. This limitation hinders their ability to handle long-context scenarios efficiently. The research proposes that leveraging the visual modality could offer an effective solution for compressing textual information, as a single image can represent rich document content using significantly fewer tokens than raw digital text. Current Vision-Language Models (VLMs), however, often struggle with balancing high-resolution input processing, maintaining low activation memory, and minimizing the number of vision tokens, which are crucial for effective compression.
The authors introduce DeepSeek-OCR, a Vision-Language Model (VLM) designed as a proof-of-concept for efficient vision-text compression. DeepSeek-OCR consists of two main components: DeepEncoder and DeepSeek3B-MoE, which acts as the decoder. The DeepEncoder is a novel architecture engineered to process high-resolution inputs while maintaining low activation memory and generating a manageable number of vision tokens. It achieves this by serially connecting window attention and global attention encoder components, utilizing a 16x convolutional compressor between them. This design allows the window attention to handle a large initial number of tokens, which are then significantly reduced by the compressor before entering the dense global attention component, thereby controlling memory and token count. DeepEncoder supports multiple resolution inputs—Tiny (512x512, 64 tokens), Small (640x640, 100 tokens), Base (1024x1024, 256 tokens), Large (1280x1280, 400 tokens), and dynamic resolutions like Gundam mode for ultra-high-resolution images—by dynamically interpolating positional encodings. The decoder component employs a DeepSeek3B-MoE architecture, activating approximately 570 million parameters during inference. This choice provides the expressive power of a 3B model with the inference efficiency of a smaller 500M model, making it well-suited for domain-centric VLM research like OCR. DeepSeek-OCR is trained on a diverse dataset comprising OCR 1.0 (traditional OCR), OCR 2.0 (parsing tasks for complex artificial images like charts, chemical formulas, and plane geometry), general vision data, and a portion of text-only data to preserve language capabilities. The training involves a two-stage pipeline: independent training of DeepEncoder, followed by end-to-end training of DeepSeek-OCR.
DeepSeek-OCR demonstrates compelling performance in vision-text compression and OCR tasks. On Fox benchmarks, the model achieves over 96% OCR decoding precision with a text compression ratio of 9-10x. Even at a 20x compression ratio, the OCR accuracy remains around 60%. These results indicate the significant potential of optical context compression. Furthermore, DeepSeek-OCR achieves state-of-the-art performance on OmniDocBench while using remarkably fewer vision tokens compared to existing models. For instance, it surpasses GOT-OCR2.0 (which uses 256 tokens) with only 100 vision tokens and outperforms MinerU2.0 (requiring nearly 7,000 vision tokens) using fewer than 800 tokens in Gundam mode. The model also exhibits "deep parsing" capabilities, enabling it to parse charts, geometric figures, chemical formulas, and natural images within documents using a unified prompt. DeepSeek-OCR supports nearly 100 languages for PDF documents and offers high practical value for large-scale data generation, capable of producing over 200,000 pages of pretraining data per day on a single A100-40G GPU, or 33 million pages per day across 20 nodes with 8 A100-40G GPUs each.
The findings from DeepSeek-OCR underscore the feasibility and promise of contexts optical compression as a novel approach to address the long-context challenges faced by Large Language Models. The demonstrated ability to achieve significant text token reduction (7-20x) through visual modality opens new avenues for enhancing computational efficiency in large-scale text processing and agent systems. This research provides empirical guidelines for optimizing vision-language model token allocation and showcases a practical and deployable architecture with DeepEncoder. Beyond its application in OCR, this paradigm suggests a potential path towards theoretically unlimited context architectures by drawing parallels between human memory decay and visual perception degradation. This allows for a "forgetting mechanism" where recent information maintains high fidelity through higher resolution, while older contexts are progressively downsized and consume fewer resources. While an initial exploration, the work highlights a promising new direction for the synergistic combination of vision and language modalities to improve LLM and VLM development, particularly in areas like historical long-context compression and memory mechanisms.