AI Summary • Published on Dec 18, 2025
Medical imaging AI competitions, while crucial for advancing AI, may not provide sufficiently representative, accessible, and reusable data to support clinically meaningful AI. This systematic study aimed to assess fairness along two dimensions: the representativeness of challenge datasets concerning real-world clinical diversity, and their accessibility and legal reusability in line with FAIR principles. The authors note that despite the potential of AI in medicine, few algorithms reach clinical use, partly due to a lack of rigorous, real-world validation, making fair benchmarking critical.
The researchers conducted a large-scale systematic review of 241 biomedical image analysis challenges, encompassing 458 tasks across 19 imaging modalities, organized between 2018 and 2023. Data was collected from challenge websites and associated publications. Two independent observers screened each challenge against predefined parameters related to general information, data access/licenses, and data quality. A third senior screener resolved conflicts to ensure consistent interpretation. The study specifically focused on Accessibility and Reusability aspects of the FAIR principles, examining practical barriers rather than legal or normative standards.
The study uncovered substantial biases in dataset composition, including geographic (70% US data, with minimal representation from other continents), modality (MRI and CT dominated, accounting for 35% and 21% respectively, while other common clinical modalities were rare), and problem type (segmentation, classification, detection comprised the majority). Most data came from few centers and devices. Furthermore, despite 81% of datasets being nominally public, accessibility was often hindered by mandatory registration (38%), organizer approval (20%), or context limitations. Licensing practices showed widespread issues: only 59% provided detailed license information, and only 41% used unambiguous licenses. Restrictive licenses dominated (44%), and 80% of tasks had some form of inconsistency in data licensing or access practices, ranging from unclear/borderline cases (43%) to inconsistent/misleading (20%) and potentially non-compliant cases (38%). Critical documentation gaps were also prevalent, with missing information on acquisition devices (43%), ethics approval (35%), study population (41%), case selection criteria (49%), and annotation protocols (60% incomplete).
The findings suggest that current medical imaging AI benchmarks do not adequately reflect real-world clinical diversity and possess significant fairness limitations, hindering the generalizability and clinical translation of AI models. The observed biases, such as MRI-centric benchmarks, implicitly favor resource-rich healthcare systems, while task distributions often reflect research conventions over clinical priorities. Restrictive or ambiguous data access and licensing conditions actively impede legitimate reuse and slow scientific progress, raising legal risks and undermining reproducibility. Incomplete documentation further obstructs reusability and meaningful comparisons. The paper concludes that current benchmarking practices may be misaligned with real-world application needs, questioning whether leaderboard success truly equates to clinical readiness. To address these issues, the authors propose integrating emerging standards for transparency, provenance, and legal certainty, such as machine-readable metadata and clear, standardized licensing information, into challenge requirements.