AI Summary • Published on Dec 2, 2025
Perceptual similarity scores are fundamental for both training and evaluating computer vision models, aiming to quantify image differences in a way that aligns with human vision. Traditional, handcrafted metrics like SSIM are interpretable but often fail to capture complex perceptual properties, struggle to adapt to image content, or are non-differentiable, limiting their use as loss functions. Deep learning-based metrics, such as LPIPS, achieve better alignment with human perception but rely on opaque, non-linear features from discriminative networks. This leads to a lack of interpretability, unknown invariances, sensitivity to subtle perturbations, and high computational costs for training with large annotated datasets. Existing unsupervised methods also face challenges in interpretability and often require extensive datasets.
The Structured Uncertainty Similarity Score (SUSS) addresses these limitations by modeling each image through a set of perceptual components, such as multi-scale luminance and chrominance, each represented by a structured multivariate Normal distribution. This probabilistic approach is built upon a deep generative model, specifically a Structured Uncertainty Prediction Network (SUPN), which efficiently predicts a densely correlated multivariate Normal distribution over pixel space through a sparsely structured precision matrix. The SUPN architecture is UNet-based with separate decoder heads for predicting the mean and a sparse Cholesky decomposition of the precision matrix, operating across multiple resolutions and in the YCbCr color space. Training is conducted in a self-supervised manner, where the model learns to assign high likelihood to human-imperceptible affine and color augmentations, grounded in psychophysical principles. The final SUSS score is computed as a weighted sum of component log-probabilities, effectively acting as a weighted sum of Mahalanobis-like distances. These component weights are optimized by aligning with human perceptual judgments from datasets like BAPPS 2AFC. A key feature of SUSS is its interpretability, enabled by the closed-form Gaussian distribution, allowing inspection through sampling from learned distributions and visualizing "whitened residuals" in image space. These residuals undergo an image-specific linear transformation, making perceptually important differences more prominent while suppressing irrelevant variations.
SUSS demonstrates competitive performance against state-of-the-art deep learning perceptual metrics and significantly outperforms traditional heuristic metrics across major benchmark datasets, including BAPPS 2AFC, PieAPP 2AFC, and PIPAL. Fine-tuned SUSS variants, in particular, achieve high overall accuracy and correlation with human judgments. The model exhibits strong perceptual calibration, showing consistently low KL divergence values across various distortion types on the KADID-10k dataset, indicating a uniform mimicry of perceptual distance. Interpretability is validated through qualitative analysis of whitened residuals and samples from the learned distributions, which clearly highlight perceptually significant features and illustrate the range of perceptually plausible image variations. SUSS also proves effective as a loss function for downstream imaging tasks. In image reconstruction, it provides stable gradients and yields reconstructions with considerably fewer artifacts compared to LPIPS. For single-image super-resolution, SUSS variants produce visually comparable results to heuristic perceptual losses like SSIM, with improved artifact suppression over LPIPS, suggesting its utility as a standalone loss without requiring additional L1/L2 regularization.
SUSS represents a significant advancement in perceptual image similarity metrics by offering a probabilistic, interpretable, and human-aligned score that also functions effectively as a loss function. Its generative formulation intrinsically avoids the unknown invariances often introduced by discriminative deep models, leading to more robust and stable optimization behavior. The inherent interpretability, through inspectable whitened residuals and samples, provides transparency into similarity judgments, which is critical for trust in sensitive applications. SUSS's strong perceptual calibration and adaptability to diverse distortions highlight its ability to learn transferable perceptual cues. While SUSS is inherently asymmetric, which might require symmetrization for certain applications, this limitation is typically not problematic in common scenarios where a fixed reference image is used. Future work will focus on investigating its scalability in larger applications, further analyzing optimization stability, and exploring alternative methods for capturing perceptual invariances.