AI Summary • Published on Jan 10, 2026
Personalized mobile AI applications face a fundamental challenge in balancing three conflicting requirements: immediacy (adapting to recent behavior), stability (resisting noise and supporting long-horizon prediction), and generalization (handling sparse data and cold-start users). Existing methods typically satisfy at most two of these, leading to an "impossibility triangle" in data-scarce and non-stationary mobile environments. This fragmentation often results in different tasks, such as short-term prediction, long-term forecasting, and cold-start recommendation, being treated as separate problems, despite originating from the same underlying spatio-temporal behavioral process.
The paper proposes U-MASK, a unified framework that redefines personalized mobile AI inference as a conditional completion problem on partially observed spatio-temporal tensors. U-MASK integrates three primary components:
1. U-MASK (Masking Mechanism): This user- and task-specific module generates a binary mask that explicitly defines which spatio-temporal regions of a user's behavior tensor are treated as observed evidence and which need to be inferred. The mask's allocation is determined by a target observation ratio, a spatio-temporal sampling distribution that ranks coordinate utility, and a ratio-constrained sampler. It dynamically adjusts the evidence budget based on user reliability and task sensitivity, prioritizing relevant temporal and spatial segments through feature-weighted spatial affinity. For cold-start scenarios, an exploration mechanism flattens the sampling distribution to prevent overfitting to noisy history.
2. U-SCOPE (Semantic User Profiling): This component infers compact, task-agnostic semantic user representations from sparse and heterogeneous app-location interaction histories. It leverages a pre-trained Large Language Model (LLM) as an amortized semantic inference engine to translate low-level mobile telemetry into structured natural-language semantic profiles, which are then encoded into fixed-dimensional embeddings. Optionally, it employs synthetic data augmentation and Direct Preference Optimization to enhance robustness in data-sparse settings.
3. Shared Diffusion Transformer (DiT) Backbone: This generative architecture performs the conditional completion. It takes the noisy behavior tensor, observed evidence, the generated mask, and the semantic user embedding as input. The DiT iteratively denoises the tensor, reconstructing unobserved regions while ensuring observed evidence remains unaltered. This backbone is shared across all inference tasks, including short-term prediction, long-term forecasting, and cold-start recommendation. The entire system is trained end-to-end, allowing gradients from the diffusion loss to optimize both the mask generator and the semantic representation learning.
Experiments on seven real-world mobile datasets demonstrate that U-MASK consistently outperforms state-of-the-art methods across short-term prediction, long-horizon forecasting, and cold-start recommendations.
Multi-task Learning: U-MASK showed significant improvements, ranging from approximately 5% to over 90% in prediction accuracy, particularly for location prediction, when compared to conventional, fixed masking strategies. This highlights its ability to effectively adapt to diverse user behaviors and task granularities.
Cold-start Scenarios: In app recommendation tasks, U-MASK achieved a Recall@5 of 0.9816, surpassing the strongest LLM-enhanced baselines by 1.8%. For location recommendation, it obtained Recall@1 of 0.9412 and NDCG@3 of 0.9706, outperforming SSM-based models. These results were achieved by building user representations and task-aware masks in real-time from sparse behavioral traces, without relying on pre-existing user prompts.
U-SCOPE Effectiveness: U-SCOPE's generated synthetic data accurately preserved key app-location correlations, indicated by an R² of 0.9063 and a Pearson correlation coefficient of 0.9520. Furthermore, integrating U-SCOPE profiles into various baseline models consistently improved their performance on multivariate prediction tasks. A compact model (Llama-3.1-8b), fine-tuned using U-SCOPE's synthetic data, achieved an average accuracy of 89.33%, closely matching the performance of much larger LLMs (e.g., Qwen3-235B) in cold-start novelty recommendation scenarios.
Personalized Masking: U-MASK dynamically generated distinct masking strategies tailored to specific tasks and user behavioral patterns. For short-term prediction, sparse masks emphasized recent interactions. For long-term forecasting, moderately dense masks balanced historical coverage and noise suppression. In cold-start scenarios, dense exploratory masks were employed to maximize information extraction from limited observations. This personalized masking approach, derived from learned representations of individual dynamics, allowed U-MASK to adapt feature selection to each user's predictability and behavioral complexity.
U-MASK presents a unified and robust framework for personalized mobile AI applications by effectively addressing the inherent trade-offs between immediacy, stability, and generalization through its adaptive spatio-temporal masking mechanism. This approach enables accurate and reliable inference even in complex and data-scarce environments. Future work could explore incorporating richer multimodal spatio-temporal signals, such as satellite imagery, urban street views, and structured behavioral data, to further enhance contextual understanding and generalization in personalized mobile AI systems.