AI Summary • Published on Jan 13, 2026
Existing open-source Multimodal Large Language Model (MLLM) frameworks are predominantly vision-centric, offering limited or no in-depth support for non-visual modalities such as speech, audio, and music. This lack of specialized support forces researchers in auditory domains to adapt vision-based systems, leading to inefficiencies, fragmented development workflows, and a significant barrier to advancing audio-language models.
SLAM-LLM is introduced as a modular, open-source framework specifically designed for training customized MLLMs with a focus on speech, audio, and music processing. Its core architecture follows a clean encoder–projector–LLM modular design, allowing for seamless customization of components via YAML configuration files. The framework supports a wide range of pretrained encoders (e.g., Whisper, HuBERT, BEATs, MERT), various projection modules (MLP, CNN, Q-Former), and diverse LLM backbones (e.g., LLaMA, Vicuna, Qwen). It also integrates parameter-efficient fine-tuning (PEFT) strategies like LoRA and prefix-tuning. SLAM-LLM unifies all auditory tasks into an auto-regressive generation process and provides a comprehensive suite of training and inference recipes, along with high-performance checkpoints for tasks such as Automatic Speech Recognition (ASR), Speech-to-Text Translation (S2TT), Speech Emotion Captioning (SEC), Automated Audio Captioning (AAC), and Music Captioning (MC).
SLAM-LLM achieves competitive or state-of-the-art performance across multiple benchmarks. For ASR, experiments show that larger and fine-tuned self-supervised encoders (like HuBERT X-Large) consistently yield better results, especially when paired with chat-tuned LLMs (e.g., Vicuna-7B), often outperforming traditional NN-based systems. The framework also demonstrates strong capabilities in low-resource and contextual ASR. In Automated Audio Captioning (AAC), LLM-based models using SLAM-LLM achieve state-of-the-art results on datasets like Clotho and AudioCaps, with improvements attributed to fine-tuned encoders, pre-training, PEFT, Retrieval-Augmented Generation (RAG), and projection decoding. For Music Captioning (MC), models trained with SLAM-LLM, even on smaller datasets, achieve results comparable to existing models, highlighting the effectiveness of strong encoders and the potential for single-token representations to convey sufficient information to the LLM decoder.
SLAM-LLM significantly bridges a critical gap in the current MLLM ecosystem by providing a flexible, unified, and open-source framework tailored for speech, audio, and music. By lowering the entry barrier for developing LLM-based audio systems and promoting community collaboration, SLAM-LLM is expected to accelerate research and innovation in audio-language modeling. The framework's extensive experimental results and derived empirical insights also offer valuable guidance for future research and development in multimodal AI, unlocking new possibilities in various audio-related applications.