Poster Presentation

Search Abstracts | Symposia | Slide Sessions | Poster Sessions

How do artificial neural networks (ANNs) respond to audiovisual illusions such as the McGurk effect?

Poster Session D, Saturday, September 13, 5:00 - 6:30 pm, Field House

Haotian Ma1, Zhengjia Wang1, John F. Magnotti1, Xiang Zhang1, Michael S. Beauchamp1; 1Department of Neurosurgey, University of Pennsylvania

Humans perceive speech by integrating auditory information from the talker’s voice with visual information from the talker’s face. Incongruent speech provides a useful experimental technique for probing multisensory integration. For instance, in the McGurk effect, an auditory “ba” paired with a visual “ga” (AbaVga) produces the illusory percept of “da”. Recently, artificial neural networks (ANNs) have made remarkable progress in reproducing human abilities and may provide a useful model for human audiovisual speech perception. This raises the question of how ANNs respond to incongruent audiovisual speech such as McGurk stimuli. To answer this question, we presented McGurk and control (congruent) stimuli to human observers and the ANN Audiovisual Hidden-unit BERT (AVHuBERT), developed by Meta Corporation. Twenty different McGurk stimuli were tested, all consisting of the same “ga” video paired with a different “ba” recording (all from the same female talker). A three-alternative forced-choice design with response choices of “ba”, “ga” and “da” was used for both human observers and AVHuBERT. To generalize to other incongruent stimuli, a less frequently used stimulus (recorded by four different female speakers) that contains auditory “ba” and visual “fa” (AbaVfa) was also tested in a two-alternative forced-choice setting between "ba" and "fa". Across human observers, substantial variability has been observed but AVHuBERT is deterministic. Therefore, to model individual differences in human perception, we constructed variants of AVHuBERT by adding Gaussian noise to the units in linear out-projection layers in the transformer encoder blocks of the model. Across twenty AVHuBERT variants, performance was high for congruent audiovisual speech (mean of 94%). For McGurk stimuli (AbaVga), human observers reported the McGurk percept of “da” on 33% of trials. For AVHuBERT variants, the McGurk response of “da” was reported on 32% of trials, similar to the rate found for human observers. In the AbaVfa experiment, “fa” was identified in 83% of the predictions from the same twenty AVHuBERT variants, compared to 76% of “fa” reported by human observers. We then conducted two Kolmogorov-Smirnov tests with the McGurk and “fa” responses respectively to distinguish the model variants from human observers as different subjects. The test failed to separate the AVHuBERT variants from humans based on their performance, suggesting that the weight perturbation with Gaussian noise effectively created human-like perceptual variability in the deterministic AVHuBERT. The similar response of ANNs and human observers to incongruent speech stimuli suggests that ANNs may be a useful tool for interrogating the perceptual and neural mechanisms of human audiovisual speech perception.

Topic Areas: Multisensory or Sensorimotor Integration, Computational Approaches

SNL Account Login


Forgot Password?
Create an Account