AI Summary • Published on Apr 29, 2026
The effective integration of artificial intelligence (AI) into healthcare, particularly for applications like Computerized Cognitive Training (CCT), is challenged by the difficulty AI systems face in accurately recognizing human affective states. A significant limitation is the lack of generalizability of existing affect recognition models, especially those relying on categorical labels, across different age groups. This poses a problem for CCT, which benefits from social interaction and is applied to both older adults and younger populations, as emotional expressions can vary considerably between these demographics.
This study extended the THERADIA-WoZ corpus by collecting a new dataset from 52 young adults (mean age 21.17 years), mirroring the methodology used for the original older adult corpus. Multimodal data (audio, visual, textual) was captured during CCT sessions with a Wizard-of-Oz virtual assistant. The researchers compared models for affect recognition based on 23 categorical labels and five appraisal dimensions (Novelty, Intrinsic Pleasantness, Goal Conduciveness, Coping, and Arousal). Both expert-based features (Mel-scale Filter Banks for audio, TF-IDF for text, Facial Action Units for visual) and deep learning representations (multilingual Wav2Vec2 for audio, multilingual BERT for text, CLIP for visual) were extracted. Multimodal fusion was achieved using Ordinary Least Squares regression for labels/dimension summaries and a GRU model for continuous dimensions. Models were evaluated using five-fold cross-validation with three training strategies: within-corpus, cross-corpus (trained on one age group, tested on another), and mixed-corpus (trained on both, tested on each separately). Performance was measured using the Concordance Correlation Coefficient (CCC) and analyzed with Bayesian linear models.
The study found that models based on appraisal dimensions consistently outperformed those using categorical labels in terms of predictive accuracy and stability across all evaluation conditions. Categorical labels demonstrated a significant failure to generalize across age groups, with performance dropping to chance levels during cross-corpus evaluation. In contrast, appraisal dimensions maintained predictive performance above chance, highlighting their robustness for cross-age affect recognition. Furthermore, models performed better when tested on the older adult corpus compared to the young adult corpus. Training on a mixed corpus (both age groups) did not significantly improve generalization compared to within-corpus training, suggesting no additional benefits for model adaptation by combining age groups. Multimodal fusion consistently outperformed unimodal models, and deep representations provided superior predictions compared to expert features.
These findings strongly support the theoretical and practical advantages of using appraisal dimensions over categorical labels in affective computing, especially for applications requiring cross-age generalization such as AI-assisted CCT. The robustness of appraisal dimensions offers a more stable and theoretically grounded framework for emotion modeling. The research also underscores the importance of multimodal fusion and advanced deep learning representations for accurately capturing emotional nuances. The provided API for time-continuous emotion prediction will serve as a valuable resource for behavioral sciences, enabling more refined and data-driven measurements of emotional states in various experimental settings, ultimately enhancing the effectiveness of AI systems in healthcare interactions.