Our preprint From Associations to Activations investigates whether an LLM's internal semantic geometry can be recovered from its observable behavior. Across eight instruction-tuned transformers and 17.5M+ trials, we compare behavior-derived similarity structures from forced-choice and free-association paradigms to layerwise hidden-state geometry using representational similarity analysis. We find that forced-choice behavior aligns substantially more with internal representations than free association, and that behavioral similarity predicts unseen hidden-state similarities beyond lexical baselines.
In cognitive science, semantic knowledge is treated as a latent structure: we cannot observe a speaker’s meaning representation directly, but we can systematically probe it through behavior
We transfer this measurement logic to large language models. Unlike humans, both behavior and internal representations are observable in LLMs. This creates a unique opportunity: we can systematically test how well an LLM’s behavioral output reveals its internal semantic geometry. A key open question is not only how model behavior compares to humans, but also what a model’s own behavior reveals about its own internal representations.
We use two classic psycholinguistic paradigms—forced choice (FC) and free association (FA)—to collect semantic relations from model behavior over a shared vocabulary of 5,000 high-frequency English nouns
In the forced-choice paradigm, each cue word is presented together with 16 candidate words, from which the model must select exactly two words that are most semantically related to the cue. Candidate sets are constructed by a deterministic shuffle of the remaining vocabulary, yielding 313 FC trials per cue. In the free-association paradigm, the model is prompted with a single cue word and asked to generate exactly five single-word associates. This is repeated across 126 stochastic runs per cue.
For each paradigm, model outputs are aggregated into a sparse cue–response count matrix $\mathbf{B}$. We reweight counts with positive pointwise mutual information (PPMI) to reduce the influence of globally frequent responses, then compute a cue–cue similarity matrix via cosine similarity between the PPMI-weighted row vectors. In total, we collected over 17.5 million trials across both paradigms and eight models.
For each model and each word, we extract layerwise hidden-state representations under four contextual embedding strategies:
Hidden-state similarity matrices are computed as cosine similarity between mean-centered layerwise word vectors
We evaluate eight instruction-tuned decoder-only transformer models ranging from 7B to 14B parameters (Falcon3, Gemma-2, Llama-3.1, Mistral-7B, Mistral-Nemo, Phi-4, Qwen2.5, and rnj-1). Beyond behavioral embeddings, we compare hidden-state similarities to three baselines: FastText (static word vectors)
We use three complementary evaluation methods:
Representational Similarity Analysis (RSA)
Nearest-neighbor overlap ($\mathrm{NN@}k$): We quantify how well the $k$-nearest-neighbor neighborhoods induced by hidden-state similarity match those of the reference spaces.
Held-out-words ridge regression: We test whether behavioral similarity predicts unseen hidden-state similarities on held-out words beyond lexical baselines and cross-model consensus.
Across all models and evaluation methods, FC behavior aligns substantially more strongly with hidden-state geometry than FA. Mean FC RSA increases from $r = .346$ under Averaged extraction to $r = .463$ under Task (FC), while FA shows the same pattern at considerably lower magnitude ($r = .140$ to $r = .199$).
Task-aligned and meaning-based extraction strategies yield the strongest alignment at earlier, mid-depth layers, whereas averaging over natural contexts shifts alignment peaks to later layers.
The full model-by-model RSA comparison reveals that the FC advantage is consistent across all eight models, though the magnitude varies:
The held-out-words ridge regression shows that behavioral similarity—especially FC—predicts unseen hidden-state similarities beyond lexical baselines and cross-model consensus. Adding behavioral FC similarity on top of the baseline improves mean test $R^2$ by $+.022$, whereas FA yields a smaller gain ($+.002$). The full model reaches mean $R^2 = .587$ (vs. $.569$ for the baseline). Peak performance reaches $R^2 = .844$ for Llama-3.1-8B-Instruct.
Our findings show that structured behavior—particularly from constrained measurement paradigms like forced choice—preserves a nontrivial projection of a model’s hidden-state similarity geometry, even without access to logits or internal activations. This has implications for both interpretability research and cognitive science:
For interpretability: Behavioral probing can serve as a practical tool for understanding internal representations when only black-box access is available. The FC paradigm’s controlled candidate sets concentrate observations and produce a less sparse cue–response matrix, yielding higher signal-to-noise measurements of semantic geometry
For cognitive science: Using our fully observable language-model setup, we can subject a core assumption to rigorous empirical tests—that structured behavior is constrained by, and can therefore partially reveal, internal states. The finding that measurement protocol strongly determines recoverability (FC vs. FA) suggests that whether a behavioral task reveals internal structure is not a generic property of “behavior”: it depends critically on how responses are constrained and aggregated.
Cross-model consensus: A striking finding is the strength of cross-model consensus—similarity structure shared across other LLMs explains a large fraction of variance in a target model’s hidden-state geometry, consistent with the hypothesis of a substantial common semantic subspace
If you find this interesting and if this work is helpful for your research, please consider citing our paper:
@misc{schiekiera2026associations,
title={From Associations to Activations: Comparing Behavioral and
Hidden-State Semantic Geometry in {LLMs}},
author={Schiekiera, Louis and Zimmer, Max and Roux, Christophe
and Pokutta, Sebastian and G{\"u}nther, Fritz},
year={2026},
eprint={2602.00628},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.00628},
}