Poster Presentation

Search Abstracts | Symposia | Slide Sessions | Poster Sessions

Vision Language Model Representations Predict EEG Response to Visual and Auditory Attributes in Property Verification

Poster Session E, Sunday, September 14, 11:00 am - 12:30 pm, Field House

Harshada Vinaya1, Sean Trott1, Seana Coulson1; 1University of California, San Diego

A key debate in the semantic memory literature contrasts the grounded/embodied vs language-distributional sources of our semantic knowledge. In recent years, support for distributional semantics has increasingly come from studies showing that vector representations from large language models can, to a certain extent, explain human behavior and brain data in language tasks. However, the lack of computational models of embodied semantics has made it difficult to perform comparable tests of grounded accounts. An alternative is to use multimodal language models that combine text with other sensory information (e.g., vision, sound), as their embeddings are informed by the perceptual properties of real world objects that can afford quantitative predictions regarding semantic processing. Here we use CLIP, a vision and text model, and GPT2, a model similar in architecture to CLIP’s text encoder, to ask whether distributional representations informed with vision (CLIP) can explain human neural responses while controlling for contributions by purely distributional information (GPT2). We model EEG data collected from 19 participants, in a property verification task with LMER models predicting differences in EEG to visual (“red”) versus auditory (“loud”) words. This analysis included all TRUE responses from the task (e.g., “APPLE-red”, “DYNAMITE-booms”) for visual (e.g., ”red”, n=1373) and auditory (e.g., ”loud”, n=1160) property words. We measured mean voltage at each electrode in successive 100ms windows, beginning with the onset of the property word until 700ms, and fit the following five LMER models for each of seven windows. The NULL model included no lexical or semantic predictors, but only interactions of scalp-dimensions and modality type (auditory/visual) as fixed effects, and item and subject-level random effects. The rest of the models progressively include predictors of interest in interaction with the scalp dimensions and modality. The second model is the BASE model that includes word frequency and number of letters. The third model builds on by adding GPT2 cosine distance between the concept and the property vectors to operationalize contribution from language distributional information to the predictors in the BASE model. Similarly, to operationalize contributions from visual resources, the fourth model added cosine distances for CLIP to the BASE model. The final/full model adds all the predictors. We used Akaike Information Criterion (AIC) scores for statistical model comparison considering AIC differences of 10 as robust evidence for a more likely model. Model comparison results show that the Base model improves the fit above the null model from 200-700ms, GPT2 improves over Base 0-400 and 500-700ms, and CLIP improves over Base from 200-700ms. The full model improves over the GPT2 model and the CLIP model across all windows except the first one (i.e. 0-100 ms). This suggests that human semantic representations in this task are informed both by the statistical regularities of a word and by its associated visual properties. Superior performance of GPT2 in the first 100ms may support the proposal that early word processing is sensitive to pure distributional information.

Topic Areas: Meaning: Lexical Semantics,

SNL Account Login


Forgot Password?
Create an Account