AI Summary • Published on Mar 12, 2026
Supervised Semantic Differential (SSD) is a method that combines quantitative and interpretive approaches to model how the meaning of text shifts in relation to continuous individual-difference variables. A core part of SSD involves applying Principal Component Analysis (PCA) to text representations before regression to reduce dimensionality, especially for smaller datasets. However, the existing SSD methodology lacks a systematic and principled way to select the optimal number of principal components (K) to retain. This absence introduces avoidable "researcher degrees of freedom," which can lead to issues such as overfitting, reduced transparency in the analysis pipeline, and potential biases in the substantive interpretation of the resulting semantic gradients.
The authors propose a PCA sweep procedure to systematically address the selection of K. This procedure frames K selection as a joint optimization problem that considers three key properties crucial for SSD as an interpretive method: the quality of the semantic representation after dimensionality reduction, the interpretability of the derived semantic gradient, and the stability of that gradient across a range of K values. The sweep evaluates a sequence of K values, performing SSD at each step and tracking diagnostic metrics. These metrics include cluster coherence (as an interpretability criterion) and gradient stability, measured by cosine differences between consecutive gradients. Instead of solely focusing on variance explained, the method prioritizes solutions with coherent and stable semantic structures. A plateau-sensitive smoothing technique is applied to both interpretability and stability curves to emphasize broad, stable plateaus over sharp spikes. Finally, a joint score is computed, combining the smoothed interpretability and stability, and the smallest K that achieves the maximal joint score is selected, favoring parsimonious, stable, and interpretable solutions. The method was demonstrated using a corpus of short posts about artificial intelligence from Prolific participants, who also completed Admiration and Rivalry narcissism scales. Text was embedded using the 300-dimensional Dolma GloVe model.
When applied to the case study, the PCA sweep procedure identified an optimal K value for the Admiration (ADM) narcissism scale. The model for ADM explained a small-to-moderate but reliable proportion of variance (adjusted R²=.19, p<.00000000001), yielding a pronounced semantic gradient. Interpretation of this gradient revealed two distinct poles: the positive pole reflected an optimistic, collaborative, and prosocial framing of AI, associating it with innovation, integration, and constructive technological progress. Conversely, the negative pole emphasized distrust, antagonism, and derision, portraying AI and its creators as deceptive, biased, or ideologically motivated. In contrast, the Rivalry (RIV) model did not achieve statistical significance. Furthermore, a counterfactual analysis conducted with an arbitrarily high dimensionality (K=120) resulted in semantically diffuse and difficult-to-interpret clusters, lacking the coherence and meaningful structure observed with the sweep-selected K, thereby reinforcing the value of the proposed systematic approach.
The PCA sweep procedure significantly strengthens Supervised Semantic Differential (SSD) by providing a transparent and stability-aware method for dimensionality selection, which in turn reduces researcher degrees of freedom. This approach helps preserve SSD's qualitative aims by discouraging over-parameterized representations that do not reflect psychologically meaningful structure. Beyond the methodological contribution, the case study revealed a psychologically interpretable association between Admiration and the semantics of AI discourse, consistent with existing theories of narcissism. This demonstrates how SSD, when paired with a principled choice of PCA dimensionality, can surface stable and interpretable semantic gradients that link language use to underlying psychological dispositions. While the PCA sweep addresses one key source of flexibility, future research needs to develop similar principled criteria for other upstream modeling choices, such as the selection of the base embedding model. Limitations include the relatively small, survey-elicited dataset and the use of whole-text representations, which might limit generalizability and blur finer-grained semantic distinctions. The authors also emphasize that SSD is designed for hypothesis generation and meaning exploration in psychological and social research, not for predictive or profiling applications due to its estimation of weak, low-variance semantic gradients.