Search Abstracts | Symposia | Slide Sessions | Poster Sessions
Phonetic Encoding and Sentence Predictability in Continuous Speech
Poster Session D, Saturday, September 13, 5:00 - 6:30 pm, Field House
This poster is part of the Sandbox Series.
Will Chih-Chao Chang1, Jiaxuan Li1, Xin Xie1; 1University of California, Irvine
Listeners readily use fine-grained phonetic detail—such as millisecond differences in voice onset time (VOT)—to distinguish between isolated words (e.g., GOLD vs. COLD; McMurray et al., 2002). However, in everyday speech, words typically unfold within continuous sentences. While previous work agrees that top-down predictions shape phonetic encoding during sentence comprehension, they diverge regarding the exact mechanisms. One possibility is that listeners leverage prior information from high-predictability (HP) sentential context and attend less to actual phonetic input, resulting in less veridical phonetic encoding (Manker, 2019). Alternatively, HP context helps listeners pre-activate lower-level representations, making the phonetic encoding of upcoming words more accurate (Broderick et al., 2019). To test these possibilities, this study investigates whether and how listeners dynamically adjust the precision of phonetic encoding as speech unfolds based on sentence predictability. We first manipulated English words with word-initial voiced stops (e.g., /g/ in GOLD) to create three levels of VOT ambiguity—No, Some, and Maximal—pre-selected based on a norming experiment using synthesized VOT continua of minimal pairs contrasting in voicing (e.g., GOLD-COLD). Selected VOT tokens were then spliced into audio sentences with high (e.g., The treasure hunter found a GOLD necklace) or low (e.g., The young man found a GOLD necklace) predictability context, with target word surprisal determined using GPT2. 61 native English speakers recruited from Prolific completed a cross-modal identity priming task. Participants listened to auditory sentences while making lexical decisions on visual strings, presented at the offset of the voiced target word within each sentence. There were 72 critical trials—equally distributed across three types of real-word visual strings: identical voiced targets (e.g., GOLD), voiceless competitors (e.g., COLD), and unrelated controls (e.g., RULE)—and 144 filler trials which included 36 real-word and 108 non-word visual strings. Participants were instructed to respond as quickly as possible while the auditory sentence continued. Comprehension question probes were randomly interleaved to ensure attention to the auditory sentences. A GLMM model (3 Visual Word Type*3 VOT Ambiguity Level*Target Word Surprisal) on the reaction times (RTs) of critical trials revealed a significant identity priming effect, with faster RTs for identical vs. unrelated visual words (β = -142.27, p<.001). RTs for identical words were slower with: (1) increased VOT ambiguity (Some vs. No: β = 107.31, p<.05), and (2) higher target word surprisal (β = 10.16, p<.05). Crucially, a three-way interaction showed that the slowing effect of VOT ambiguity (Some vs. No) was attenuated for identical words when surprisal increased (β = -13.71, p<.05). These findings suggest that listeners are more sensitive to fine-grained phonetic detail in HP context, supporting the pre-activation hypothesis that top-down predictions facilitate phonetic processing. Nonetheless, contrary to this hypothesis: (1) RTs for identical words were unaffected by Maximal ambiguity in VOT, and (2) RTs for competitor words did not decrease with greater VOT ambiguity, which should enhance activation of voiceless competitors. The current finding extends prior work on spoken word recognition and calls for attention between bottom-up and top-down influences on the perception of continuous speech.
Topic Areas: Speech Perception,