AI Summary • Published on Dec 2, 2025
Large Language Models (LLMs) primarily excel in high-resource languages like English and Chinese, leaving low-resource languages such as Tibetan significantly underserved. The scarcity of large-scale corpora and standardized NLP benchmarks for Tibetan poses a major challenge for developing inclusive language technologies. While existing multilingual models offer cross-lingual transfer, they often struggle with extremely low-resource languages due to limited token coverage, orthographic variation, and complex morphology. Furthermore, there has been a lack of systematic research into adapting modern LLMs to Tibetan and understanding the internal mechanisms of how model parameters evolve during such adaptation.
To address these challenges, the researchers proposed a two-stage adaptation pipeline for Qwen2.5-3B to Tibetan. The first stage, Continual Pretraining (CPT), focused on establishing Tibetan linguistic grounding by exposing the model to a large-scale corpus of 200,000 Tibetan-only texts from sources like CUTE and tibetan-mix. This stage used a standard left-to-right causal language modeling objective for one epoch. The second stage, Supervised Fine-Tuning (SFT), specialized the model for specific tasks and instruction following. The SFT corpus comprised 50,000 examples with an 80/20 mixture of Tibetan-focused tasks (Tibetan instruction following, Chinese-to-Tibetan, and English-to-Tibetan translation) and Chinese general instructions. Training was conducted using the LLaMA-Factory framework on NVIDIA H20 GPUs, utilizing BF16 mixed precision and DeepSpeed ZeRO-2 for efficiency. Different learning rates and epoch counts were used for each stage to optimize for linguistic adaptation and task alignment, respectively.
The empirical evaluation demonstrated significant performance gains. Perplexity consistently decreased from 2.98 (base model) through CPT (1.61) to SFT (1.54), indicating improved Tibetan language modeling. Chinese→Tibetan translation quality showed substantial improvements, with BLEU increasing from 0.046 to 0.261 (a 5.7 times increase) and chrF rising from 2.2 to 6.6 (a 3 times improvement). English→Tibetan translation also improved significantly, achieving BLEU scores of 0.186 and chrF of 5.4. A layer-wise analysis across 435 layers of Qwen3-4B revealed that model adaptation primarily concentrated on embedding matrices and output heads, with secondary updates in mid-to-late MLP gate projections. Importantly, the correlation between CPT and SFT weight changes was near-unity (r≈1.0), suggesting that supervised fine-tuning consolidates and sharpens the linguistic manifold established during continual pretraining rather than overwriting it.
This study provides a practical and reproducible two-stage framework for effectively adapting large language models to low-resource languages, exemplified by Tibetan. The findings offer significant implications for bridging the gap between multilingual foundation models and underserved linguistic communities, promoting more equitable access to advanced LLM capabilities. The detailed layer-wise analysis sheds new light on the internal mechanisms of LLM adaptation, indicating that continual pretraining performs the heavy lifting of re-anchoring embeddings and output heads, while supervised fine-tuning consolidates these linguistic representations for task-specific alignment. This consolidation mechanism helps mitigate catastrophic forgetting and contributes to consistent performance improvements across stages.