AI Summary • Published on Jan 27, 2026
The paper identifies a fundamental principle that "Compression Tells Intelligence," where a system's ability to efficiently compress information reflects its intelligence. This concept has driven two distinct areas of visual technology: classical visual coding, focused on information theory and pixel-level fidelity for multimedia, and emerging visual token technology, which uses generative AI to extract semantic information for large language models (MLLMs). Despite both aiming for efficient visual representation, these fields have developed independently. This divergence creates a significant gap: classical codecs excel in data compression but are not inherently suitable for AI models, while visual tokens are effective for AI tasks but currently lack the theoretical depth and compression rates of traditional methods. Bridging this divide is crucial for a comprehensive understanding of the trade-off between compression efficiency and model performance, which is vital for advancing visual intelligence.
The authors present a comprehensive review of both classical visual coding and visual token technologies, outlining their histories, core principles, and techniques. They propose a theoretical framework that unifies these two domains by examining them through the lens of information theory (Shannon vs. semantic entropy), functionality (statistical vs. context-aware redundancy reduction), optimization (rate-distortion vs. information bottleneck), and objectives (human vs. machine fidelity). Visual tokenization is reframed as an information bottleneck problem, optimized by balancing a compression loss (reconstruction error) with a task preservation loss (cross-entropy). The paper introduces a token efficiency ratio to quantify this balance, linking it to classical rate-distortion theory. Furthermore, it details how classical coding principles, such as structural decorrelation, entropy-aware rate control, and vector quantization, can be adapted to improve visual tokenizers. Conversely, it explains how MLLMs can enhance traditional codecs through semantic-guided rate allocation, feature-domain compression, and universal probabilistic modeling. Concrete examples, like Quadtree Partitioning for visual token pruning (QPID) and a multimodal LLM-tailored codec (CoTAM), illustrate the practical application of this unified perspective.
The paper showcases the significant potential of visual token technology in practical applications such as Multimodal Large Language Models (MLLMs), AI-generated content (AIGC), and embodied AI. Through experimental validation, the Quadtree Partitioning-based Visual Token Pruning (QPID) method demonstrates superior performance in maintaining reasoning accuracy at highly reduced token budgets. For instance, QPID preserves 96.82% of full-token average accuracy with only 25% of tokens and 90.22% with a mere 6.25% of tokens, outperforming existing methods across various benchmarks. Ablation studies confirmed that both entropy-driven selection and adaptive quadtree allocation are critical for this stable accuracy at small token counts. Another case study, CoTAM, a codec designed for MLLMs, achieved up to 36% bitrate savings while maintaining comparable MLLM performance across six benchmarks, highlighting the benefits of machine-perception-optimized compression. Generally, studies mentioned in the paper indicate that preserving just 25–35% of visual tokens can retain over 95% of reasoning accuracy in semantic understanding tasks.
This work's unification of visual coding and visual token technology provides crucial bidirectional insights, indicating that established coding principles can significantly enhance modern token systems, and that token-based semantic modeling will drive the development of next-generation codecs optimized for machine learning tasks. The paper forecasts future advancements, including the creation of unified tokenizers that intrinsically balance semantic alignment with high-fidelity reconstruction. It also highlights the potential for standardized, efficient token communication technologies akin to traditional codecs like H.264/265, applicable across a wide range of intelligent tasks. The research underscores the evolving role of tokenization in AI-generated content for creating compact, semantically rich representations, and in embodied AI for efficient perception, context compression, and real-time control through machine-native interfaces. The framework is poised for extension to emerging modalities such as 3D and 4D information, paving the way for truly unified multimodal intelligence.