Search Abstracts | Symposia | Slide Sessions | Poster Sessions
Contrasting Audio and Text Transformers Reveals Unique Paralinguistic Maps in Human Cortex
Poster Session E, Sunday, September 14, 11:00 am - 12:30 pm, Field House
Subha Nawer Pushpita1, Leila Wehbe1; 1Carnegie Mellon University
The availability of new large naturalistic fMRI datasets is now letting us investigate questions about how language is processed in a realistic setting where other modalities, such as visual or auditory input, are simultaneously being perceived. In this work, we wanted to tease apart linguistic and paralinguistic information (emotion, prosody, speaker identity, conversational turn‑taking, non-speech sound effects) in continuous speech and analyze the contribution of each kind of content to the brain by exploiting two sibling models that share the architecturally identical 7B‑parameter LLM: Qwen2‑7B‑Instruct and Qwen2‑Audio‑7B‑Instruct. Although both models share the same 7B‑parameter decoder, Qwen2‑Audio‑7B‑Instruct has an audio encoder and undergoes an extra round of audio‑language alignment task learning. Its curriculum involves learning automatic speech recognition, speech‑to‑text translation, emotion and speaker recognition, sound‑event detection, vocal sound classification, and music captioning, giving the model broad auditory exposure. Training spans 30‑plus multilingual datasets, each framed with natural‑language instructions that specify the task and target language, so the model learns to flexibly align sound and text under a prompt‑based interface. For every episode of the Courtois NeuroMod Friends dataset (six participants, whole first season; ~9.6 h fMRI), we use the audio model (Qwen2‑Audio‑7B‑Instruct) to transcribe the dialogue, then feed that exact transcript through the language model (Qwen2‑7B‑Instruct) so that both models have access to the same linguistic content, while the audio one also has access to auditory cues important for comprehending the sitcom. The procedure yields two matched feature spaces: Language space - text embeddings that capture lexical, syntactic, and semantic content. Audio‑enriched space - embeddings generated by the very same decoder after it has been jointly conditioned on a spectrogram encoder and further fine‑tuned on 30 + audio‑language tasks. Because the decoder now co‑processes acoustic features, these embeddings weave in speaker identity, prosody, emotion, non‑speech sound effects, and conversational turn‑taking cues alongside the lexical, syntactic, and semantic information. After anatomically + functionally aligning all brains, we build voxel‑wise ridge‑encoding models for each feature space and (ii) their conjunction, and perform variance‑partitioning to estimate the unique explained variance of each space. The joint model predicts activity across a bilateral, symmetric network encompassing superior temporal cortex, inferior and medial frontal areas, high‑level visual cortex, temporo‑parietal junction, and precuneus. Crucially, the language features explain no unique variance, whereas the audio‑enriched features uniquely predict responses in large swathes of temporal cortex and posterior parts(the high-level visual cortex, the temporo-parietal junction, and the precuneus). These results hint at two hypotheses: (a) paralinguistic cues—speaker, affect, prosody—are represented independently of lexical content in both language-related regions and other association regions, and (b) such cues may either provide richer scene‑level context or covary with visual or other information processed in the same areas. These cases are not mutually exclusive. Our matched‑backbone, transcript‑controlled framework delivers a clean dissociation of “who is speaking” and “how they sound” from “what they say” in natural dialogue, and offers a general recipe for disentangling modality‑specific versus modality‑general signals in multimodal brain datasets.
Topic Areas: Computational Approaches, Speech Perception