AI Summary • Published on Jan 20, 2026
Decoding visual experiences directly from human brain activity captured by fMRI presents a significant challenge. A major hurdle is the substantial variability in cortical responses across different individuals (inter-subject) and even within the same person across repeated trials (intra-subject). This inherent subjectivity means the same visual stimulus can elicit highly diverse fMRI patterns, making the mapping from fMRI to visual images highly non-injective. Existing fMRI-to-image reconstruction methods largely operate in a 'within-subject' paradigm, requiring extensive training data for each new individual and struggling to generalize to previously unseen subjects. The absence of a suitable large-scale dataset with both diverse subjects and stimuli further complicates the development and evaluation of 'zero-shot cross-subject' fMRI-to-image reconstruction, where a model must reconstruct images for an individual it has never been trained on.
To address these challenges, the authors introduce two key contributions: the UniCortex-fMRI dataset and the PictorialCortex framework. UniCortex-fMRI is a unified cortical-surface fMRI dataset, integrating four heterogeneous visual-stimulus datasets (NSD, BOLD5000, NOD, HCP-Movie) onto a common cortical surface representation. This dataset provides extensive coverage of both subjects and stimuli, enabling principled evaluation of cross-subject generalization. The PictorialCortex framework explicitly models fMRI activity using a compositional latent formulation. It consists of three stages: First, a universal fMRI autoencoder is pretrained on a large dataset (UK Biobank) to learn a shared, compact cortical latent space, acting as a stable foundation across individuals. Second, a Latent Factorization–Composition Module (LFCM) is introduced. LFCM factorizes each fMRI observation into four interpretable components within this universal latent space: a stimulus-driven factor, a subject factor, a dataset factor, and a nuisance factor. This disentanglement is reinforced by Paired Factorization and Reconstruction (PFR) and Re-Factorizing Consistency Regularization (ReFCR), which encourage robust and invariant stimulus representations while accounting for individual and experimental variability. Third, during inference for an unseen subject, surrogate latents are synthesized under multiple seen-subject conditions, re-factorized, and aggregated to produce a robust stimulus-driven code. This code then guides a diffusion-based model (IP-Adapter) to generate the final image reconstruction.
Extensive experiments demonstrate that PictorialCortex significantly outperforms existing fMRI-to-image reconstruction frameworks, including NeuroPictor, MindBridge, and MindEye2, across both low-level visual fidelity and high-level perceptual consistency metrics. The model reliably reconstructs semantic content and spatial structure for unseen subjects in a zero-shot inference setting, unlike baselines which often produce blurred or semantically inconsistent results. Ablation studies confirm the critical contribution of each component: the Latent Factorization–Composition Module (LFCM) and its compositional factors (stimulus-driven, subject, dataset, nuisance), the universal fMRI autoencoder, and inference-stage refinements (surrogate-based re-factorizing and rescaling). Notably, removing the nuisance component led to the most severe performance drop, highlighting its importance in decoupling trial-specific noise from semantic content. Multi-dataset training on UniCortex-fMRI also showed substantial improvements over single-dataset training, especially for datasets with high stimulus diversity but limited subjects. Furthermore, performance consistently improved with an increased number of training subjects, emphasizing the importance of diverse subject coverage for generalizable representations.
This work provides a principled dataset (UniCortex-fMRI) and a scalable, generalizable decoding framework (PictorialCortex) for zero-shot cross-subject fMRI-to-image reconstruction. It demonstrates the feasibility of disentangling stimulus-driven visual information from various sources of variability in fMRI signals, paving the way for more robust and transferable neurodecoding systems. The compositional latent modeling approach offers an interpretable way to study how different factors interact in neural signals, inspiring future research in neural representation analysis, cross-population neuroscience, and brain-machine interfaces that demand generalization. While successful, limitations include potential error propagation from the pretrained autoencoder and the current focus on static visual perception, suggesting future work could explore joint optimization of representation learning and factor disentanglement, as well as extending the framework to temporal dynamics like video perception.