AI Summary • Published on Feb 27, 2026
Historically, astronomical research has predominantly relied on analyzing single types of data, which often provides only partial insights into the complex nature of cosmic phenomena. However, with significant advancements in observational technologies and large-scale sky surveys like LAMOST, SDSS, FAST, HST, VRO, and the upcoming SKA, astronomy is experiencing an unprecedented "data deluge." This influx includes diverse electromagnetic data (e.g., optical, infrared, X-ray) and non-electromagnetic data (e.g., gravitational waves, cosmic rays), presented in various modalities such as images, spectra, time series, tables, and text. While early attempts at multi-band and multi-messenger data fusion existed, they were largely limited to co-location and statistical association, failing to capture the complex, non-linear relationships inherent in high-dimensional, raw multimodal data due to the absence of deep feature extraction. The rise of deep learning (DL) offers a powerful solution for automatic feature extraction and modeling complex inter-modal relationships. Despite this transformative potential, DL-based multimodal data fusion (MDF) has not yet been widely adopted in astronomy, necessitating a timely and comprehensive review of this emerging field.
This review systematically outlines the landscape of deep learning-based astronomical multimodal data fusion. It categorizes and critically analyzes six representative deep learning architectures commonly employed in astronomical MDF studies: Artificial Neural Networks (ANNs), Convolutional Neural Networks (CNNs), Autoencoders (AEs), Generative Adversarial Networks (GANs), Recurrent Neural Networks (RNNs), and the Transformer architecture. For each model, the paper discusses its mechanisms, advantages, applicability, and limitations within the context of astronomical multimodal data processing.
The review further details four primary fusion strategies that dictate how information from different modalities is integrated: data-level fusion (early fusion), feature-level fusion (intermediate-layer fusion), decision-level fusion (late fusion), and hybrid fusion. Data-level fusion directly concatenates raw data before model input, suitable for highly correlated modalities but prone to heterogeneity challenges. Feature-level fusion, the most prevalent strategy, extracts modality-specific features and then combines them into a unified representation using methods like concatenation, weighted summation, or attention mechanisms, emphasizing complementary information. Decision-level fusion combines the outputs (e.g., predictions) of independently processed unimodal models, offering flexibility for asynchronous data. Hybrid fusion dynamically integrates multiple fusion levels to leverage the strengths of each. The paper also outlines the general multimodal model development process, encompassing data preprocessing (cleaning, normalization, alignment, augmentation), model pre-training, model fine-tuning, and rigorous evaluation using appropriate metrics.
The review highlights several key findings regarding the current state of astronomical multimodal data fusion research. The concept of "multimodality" in astronomy is interpreted flexibly, extending beyond traditional multimedia types to encompass variations even within the same medium, such as images from different wavebands or spectral data from various observatories. Since 2023, there has been a significant surge in astronomical multimodal studies, largely driven by advancements in deep learning, the popularization of large language models, and the inherent demand for integrating diverse astronomical data, particularly through multi-band and multi-messenger paradigms. Feature-level fusion is overwhelmingly the dominant strategy, employed in over 93% of studies, underscoring its effectiveness in capturing deep cross-modal interactions. In contrast, decision-level fusion is primarily limited to specific applications like space environment monitoring. Image data account for approximately 78% of modalities used, making them the most prominent, while spectral, time-series, tabular, and text data are less frequently utilized, although text data is gradually gaining traction. Solar physics demonstrates leading research activity in this field, with numerous studies, diverse data modalities, and specialized datasets, often propelled by the urgent need for space weather forecasting.
The review identifies several critical challenges and promising future research directions for advancing deep learning-based astronomical multimodal data fusion. Addressing the heterogeneous consistency among astronomical data modalities is crucial, requiring specialized modules for cross-modal calibration and uncertainty-aware normalization within fusion architectures. The field urgently needs large-scale, standardized multimodal fusion benchmarks to drive algorithmic progress, moving beyond fragmented, task-specific datasets towards community-driven data platforms. Enhancing computational efficiency and scalability is essential for processing vast, complex datasets, necessitating the development of cloud-native fusion algorithms that leverage cloud storage and GPU-accelerated environments. Improving model interpretability is paramount for scientific rigor, shifting towards Explainable AI (XAI) guided fusion paradigms that can elucidate how cross-modal interactions drive predictions. Mitigating data scarcity, particularly for rare cosmic phenomena, requires embedding physical priors into learning architectures and developing cross-modal completion frameworks. Finally, fostering open science practices (sharing datasets, models, code) and encouraging interdisciplinary collaboration through challenge tasks can accelerate innovation and broaden the adoption of cutting-edge MDF technologies in astronomy.