AI Summary • Published on Jan 1, 2026
The proliferation of Machine Learning (ML) models in registries like Hugging Face presents a significant challenge for Software Engineering (SE) researchers and practitioners. Despite the vast number of available models, it is difficult to identify those specifically relevant to SE tasks. This absence of an SE-focused catalogue impedes the effective integration of Artificial Intelligence (AI) into the Software Development Lifecycle (SDLC), forcing users to manually search through extensive collections. This manual process is inefficient, error-prone, and ultimately slows down the adoption of ML in SE workflows, compounded by issues such as missing attributes, inconsistencies in reported performance, and potential privacy or ethical concerns.
To address the identified gap, the authors developed SEMODS, a comprehensive dataset of SE models systematically collected, processed, and validated from Hugging Face. The methodology involved several key steps: First, a robust taxonomy of 147 SE tasks was derived across the five stages of the SDLC, building upon established literature. Second, data for each model repository on Hugging Face was collected via its API, encompassing model card descriptions, associated metadata (e.g., license, tags, libraries), and abstracts from linked arXiv papers. Third, the collected text data underwent normalization, including tokenization, lowercasing, and lemmatization, to facilitate the accurate detection of SE tasks. Finally, a rigorous two-phase validation process ensured the SE relevance of the models. This process combined manual annotation by a team of human experts, ensuring statistical validity, with assistance from a Large Language Model (LLM), specifically Gemini 2.0 Flash. The LLM was fine-tuned and tested to achieve an almost perfect level of agreement with human annotators (Cohen's kappa > 0.8) before being deployed to validate the full dataset. The dataset schema structures this information, linking models to SE tasks, standardizing benchmark results, and capturing tags and commit history. An automated pipeline runs daily to update the dataset with new models and refresh dynamic attributes.
The outcome of this work is SEMODS, a validated dataset comprising 3,427 SE models extracted from Hugging Face. These models are meticulously catalogued according to SE activities and tasks across the Software Development Lifecycle. A significant contribution is the standardized representation of benchmarks and metrics, harmonizing inconsistent evaluation reporting found in original model cards. The current dataset covers 100 of the 147 defined SE tasks, with coverage expected to increase as the automated cataloguing pipeline continues to run daily. Analysis of SEMODS reveals a specialized scope, with only a small overlap (7.12% with PeaTMOSS and 28.89% with HFCommunity) compared to generalist datasets, indicating its unique focus on specialized and emerging SE models.
SEMODS opens up numerous opportunities for researchers and practitioners in the field of AI for SE. The dataset supports quantitative and qualitative analyses, enabling the characterization of the ML ecosystem from an SE perspective, including trends in model creation and reuse, and correlations between performance, size, and popularity. It facilitates efficient model discovery, allowing users to query models based on specific SE activities and tasks, thus reducing manual exploration. The standardized evaluation results enable empirical studies on PTM performance, cross-benchmark comparisons, and the identification of areas lacking systematic evaluation. Furthermore, SEMODS aids in identifying suitable candidate models for adaptation techniques such as fine-tuning and transfer learning, potentially reducing development costs and time for SE-specific models. Future work includes expanding the dataset with additional sources and developing a recommender system.