AI Summary • Published on Apr 20, 2026
Human communication is inherently full-duplex, meaning participants can speak, listen, and interrupt each other fluidly. Replicating this in AI systems is challenging due to the limitations of traditional cascaded speech processing pipelines. These pipelines, composed of separate modules for tasks like voice activity detection (VAD), speaker recognition (SR), and automatic speech recognition (ASR), suffer from accumulated latency, information loss, and error propagation. For instance, signal processing front-ends often degrade ASR performance by distorting weak speech, and disjoint optimization prevents leveraging cross-task dependencies, leading to issues like false interruptions. While recent end-to-end audio Large Language Models (LLMs) like GPT-4o have advanced speech understanding and generation, most remain half-duplex and still rely on external, task-specific front-end components. This reliance reintroduces the problems of cascading architectures, particularly concerning robustness and responsiveness in real-world scenarios, making it difficult to achieve seamless, timely interactions.
The Unified Audio Front-end LLM (UAF) proposes a novel approach by reformulating diverse audio front-end tasks (VAD, TD, SR, ASR, and QA) into a single auto-regressive sequence prediction problem. The model takes streaming fixed-duration audio chunks (e.g., 600 ms) as input, along with a reference audio prompt to anchor the target speaker. It then regressively generates discrete tokens that encode both semantic content (ASR text and model response) and system-level state controls (e.g., interruption signals). UAF employs an "Encoder-Projector-LLM" architecture, adapting the Qwen3-Omni-30B-A3B-Instruct model. The audio encoder converts raw speech into high-dimensional acoustic features, which are then mapped to the LLM's semantic embedding space by an audio projector. The LLM's vocabulary is augmented with special state tokens for VAD (SIL, TALK) and Turn-Taking Detection (Complete, InComplete, Interrupt, Backchannel), alongside semantic tokens for ASR results and QA responses. The model uses dedicated heads for VAD and Turn-Taking, initialized from the LM head, and is fine-tuned efficiently using LoRA. A multi-stage training strategy is employed, starting with VAD/SR/ASR pre-training (6,000 hours), followed by TD and QA alignment (1,000 hours), and finally all-task joint fine-tuning on multi-turn user-agent dialogues. A hybrid data pipeline combining real and large-scale synthetic dialogues is used to generate robust, multi-talker far-field training data, complete with natural pauses, environmental noise, competing talkers, and echo injection. An acoustic-aware timestamp extraction pipeline ensures high-precision word-level timestamps for label alignment.
Extensive experiments demonstrated that UAF achieved leading performance across multiple audio front-end tasks and significantly enhanced real-world interaction quality. For VAD, UAF achieved the highest F1-score (97.57%) and a superior recall (97.99%) compared to baselines like TEN-VAD and Silero-VAD, indicating better sensitivity to true speech segments crucial for interruptions. In standard ASR tasks on public Mandarin datasets, UAF showed competitive Word Error Rate (WER) results, outperforming models like Kimi-Audio on AISHELL-2 (2.43 WER) and Qwen3-Omni on a challenging online test set (13.75 WER). Critically, for speaker-aware ASR in noisy multi-talker conditions with a reference audio prompt, UAF dramatically outperformed all baselines, achieving a 5.34 WER at 2 dB SNR where Qwen3-Omni-30B-A3B suffered from 38.6 WER, a 7x relative improvement. For turn-taking detection, UAF achieved state-of-the-art accuracy across all categories on the Easy-Turn test set, including 100.0% accuracy on 'Interrupt' type and 95.7% on 'Backchannel' type, significantly surpassing Qwen3-Omni-30B-A3B. Ablation studies confirmed that larger model sizes (30B-A3B) significantly improved robustness in low-SNR conditions, and LoRA fine-tuning achieved nearly identical performance to full parameter fine-tuning with reduced overhead. The use of dedicated task heads for VAD and TD, rather than a shared LM head, was crucial for maintaining the desired interaction protocol and achieving state-of-the-art turn detection.
UAF represents a significant paradigm shift by challenging the traditional modular, cascaded front-end processing in full-duplex speech systems. By unifying VAD, SR, ASR, TD, and QA into a single end-to-end generative framework, UAF enables the joint modeling of semantic content and interaction-level control signals, fostering a more integrated and human-like conversational experience. The model's ability to leverage a reference audio prompt for target speaker anchoring allows for robust operation in complex, noisy, multi-talker environments with system playback, addressing a critical challenge in real-world applications. This work effectively bridges the gap between low-level signal perception and high-level language reasoning, transforming "listening" from a mere preprocessing step into an intelligent, context-aware capability directly embedded within the language model. This integrated architecture paves the way for future research towards more unified perception-generation systems for embodied and interactive AI, leading to more responsive and natural spoken dialogue agents.