AI Summary • Published on Dec 2, 2025
Large Vision Transformers (ViTs), despite their impressive performance, demand vast computational resources and extensive training data, making them inaccessible for many applications. This research addresses the critical need to understand the performance of ViTs at a smaller scale, specifically investigating how different pre-training and fine-tuning strategies impact a tiny 5-million-parameter Vision Transformer (ViNy) on downstream tasks like semantic segmentation.
The researchers designed experiments using ViNy, a tiny Vision Transformer with approximately 5 million parameters. For self-supervised pre-training, they employed SimMIM (masked image modeling) on ImageNet-1K, varying the pre-training data size from 0 to 200,000 examples. To explore intermediate fine-tuning, they used the Intel Image Classification dataset. The final downstream task was semantic segmentation on the Oxford-IIIT Pet Dataset, where the model was fine-tuned with varying data sizes (250 to 6,000 examples). Model performance was assessed using both accuracy and mean Intersection over Union (mIoU).
The study found that increasing the amount of both pre-training and fine-tuning data consistently improved test accuracy and mIoU on the downstream semantic segmentation task. Pre-training offered substantial benefits, particularly when downstream fine-tuning data was scarce; for instance, a model with 200,000 pre-training examples and 250 fine-tuning examples achieved a 13% higher mIoU than a baseline without pre-training. However, both pre-training and fine-tuning exhibited diminishing returns as data size increased. Crucially, intermediate fine-tuning consistently degraded the model’s performance across all tested pre-training and fine-tuning configurations, sometimes even negating the positive effects of pre-training.
For small-scale models and resource-constrained environments, the findings suggest prioritizing ample downstream fine-tuning data and selectively adding pre-training. It is critical to be highly judicious with intermediate fine-tuning, as tasks with mechanics not tightly aligned to the final objective can significantly harm performance. Unrelated intermediate tasks may misdirect the model's learned representations, making thoughtful data selection and task alignment more effective than simply stacking additional training stages for optimal performance in low-compute regimes.