AI Summary • Published on Apr 27, 2026
In large-scale industrial software systems, identifying the precise location of a fault from bug reports is a time-consuming and costly challenge. Developers often rely primarily on textual bug reports, lacking access to runtime information, execution traces, or detailed code context, especially during maintenance phases. This reliance on incomplete information makes traditional debugging methods inefficient and highlights a significant bottleneck in software quality assurance, particularly when systems are continuously evolving and accumulating defects. Prior research often assumes access to auxiliary artifacts or focuses on open-source systems, which rarely reflects real-world industrial constraints or confidentiality requirements.
This study framed fault localization as a supervised text classification problem, aiming to predict a ranked list of likely fault locations (at the subfolder/component level) using only the natural language content of bug reports. The approach was evaluated using proprietary data from ABB Robotics, comprising five years of resolved bug reports linked to their verified code fixes. Three traditional machine learning models (Logistic Regression, Support Vector Machine, and Random Forest) were compared against two fine-tuned transformer-based language models (RoBERTa-Base and Distil-RoBERTa). Traditional models utilized TF-IDF features and sentence embeddings, while transformer models processed tokenized reports directly. Systematic text preprocessing, class-imbalance-aware data augmentation, and ranking-centric metrics (Top-kk Accuracy, Recall@kk, MAP, MRR) were employed for evaluation, focusing on actionable recommendations for industrial developers.
The evaluation demonstrated that traditional machine learning models using TF-IDF features consistently outperformed the fine-tuned transformer-based language models on the proprietary industrial dataset, especially on non-augmented data. Specifically, Logistic Regression and SVM with TF-IDF showed strong performance, with data augmentation significantly improving Random Forest's results. Distil-RoBERTa marginally outperformed RoBERTa-Base but both consistently trailed the strongest TF-IDF baselines. The best configurations achieved a Top-1 accuracy of approximately 0.53, Top-5 accuracy up to 0.86, MAP between 0.61-0.62, and MRR around 0.66. These findings challenge the assumption that transformer models universally outperform classical approaches in industrial contexts with domain-specific, limited, and imbalanced data.
The study demonstrates that historical bug reports can be systematically leveraged for text-based, AI-assisted fault localization, offering a scalable and low-cost complement to traditional debugging. The strong performance of lightweight, interpretable traditional models (LR, SVM, RF with augmentation) suggests they are highly effective and deployable in industrial settings, particularly when facing data constraints and confidentiality requirements. This provides actionable guidance for industries to make evidence-based technology choices, prioritizing on-premise solutions that integrate into existing workflows without needing source code access or external APIs. While the study's scope was a single system, the results highlight the immediate feasibility of improving triage decisions by narrowing the search space for developers. Future work includes replicating the study on additional systems, exploring hybrid models with other triage-time signals, and conducting human-in-the-loop studies to quantify effort savings.