AI Summary • Published on May 20, 2025
Machine unlearning (MU) is vital for responsible AI, particularly for complying with data privacy regulations like GDPR's "right to be forgotten." Speech data is especially sensitive due to personally identifiable information, making MU crucial for user autonomy and trust in AI systems. While MU has been explored in other domains like text and image processing, its application to complex speech tasks, such as Spoken Language Understanding (SLU) in vocal assistants, remains underdeveloped. Existing research on MU for audio processing has been limited to simpler tasks like keyword spotting, highlighting a significant gap for SLU-specific challenges.
To address this gap, the authors introduce UnSLU-BENCH, the first comprehensive benchmark for machine unlearning in SLU. This benchmark includes four intent classification datasets spanning four languages (Fluent Speech Commands, SLURP, ITALIC, SpeechMASSIVE) and evaluates two transformer models (wav2vec 2.0 and HuBERT for English; XLS-R-128 and XLS-R-53 for other languages) per dataset. The study assesses eight distinct unlearning techniques, including Fine-Tuning (FT), Negative Gradients (NG), NegGrad+ (NG+), Catastrophically forgetting the last k layers (CF-kk), UNSIR, Bad Teaching (BT, BT-L), and SCRUB. Furthermore, the paper proposes a novel Global Unlearning Metric (GUM) that simultaneously considers the efficacy (how well information is forgotten), efficiency (computational cost), and utility (model's remaining performance on the original task) of unlearning methods, comparing them against an ideal "gold" model.
The benchmark results indicate that Negative Gradients (NG) consistently achieves the highest GUM scores across various models and datasets, demonstrating exceptional efficiency (up to 1748x faster than retraining) and strong efficacy in removing speaker-specific data while maintaining utility. Other methods like Fine-Tuning balance utility and efficacy well for complex models but are less efficient. CF-kk is efficient but can lead to incomplete unlearning, while Bad Teaching variants show dataset-dependent performance. SCRUB and UNSIR generally perform poorly in GUM due to moderate speedups and inconsistent efficacy. The study also highlights that a fixed computing budget's learning rate influences the trade-off between utility and efficacy. Additionally, analyzing unlearning in SLURP* revealed that prolonged training can lead to models overfitting and making unlearning interventions less effective, suggesting an ideal training duration exists to balance learning with memorization risk.
UnSLU-BENCH establishes a critical foundation for evaluating machine unlearning in Spoken Language Understanding, emphasizing the need for robust privacy-preserving techniques in voice-based AI systems. The introduction of the Global Unlearning Metric (GUM) provides a more holistic approach to assessing unlearning methods by integrating efficacy, efficiency, and utility, which is crucial for developing trustworthy AI. The findings underscore that while retraining from scratch is ideal for perfect unlearning, practical solutions like NG offer a balanced approach. This work paves the way for future research aimed at developing more effective and computationally feasible unlearning mechanisms that can uphold user privacy and comply with data protection regulations in the evolving landscape of conversational AI.