Poster Presentation

Search Abstracts | Symposia | Slide Sessions | Poster Sessions

Transformer-based LLM (WhisperX) vs Clinician Performance for Transcribing Speech Errors in People with Aphasia

Poster Session C, Saturday, September 13, 11:00 am - 12:30 pm, Field House

Shreya Parchure1, Harris Drachman1, Leslie Vnenchak1, Denise Harvey1, Olufunsho Faseyitan1, Roy Hamilton1, H. Branch Coslett1; 1University of Pennsylvania, Philadelphia, PA

Introduction: Speech samples from people with post-stroke aphasia (PWA) offer rich markers of cognitive and linguistic status essential for diagnostics and treatment. However, the utilization of said data has been limited by the need for time- and labor-intensive manual transcription. Recent advances in artificial intelligence- (AI) based speech transcription, particularly transformer-based large language models (LLMs) like OpenAI’s Whisper, hold potential for automating these processes. Despite their utility in healthy naturalistic speech, their ability to recognize impaired speech is understudied. Aim: Here, we examine the ability of LLMs to transcribe speech samples from the Philadelphia Naming Test administered to PWA, compared with systematic IPA transcription by experienced speech-language pathologists (SLPs). Methodology: We transcribed 3660 trials of PNT speech samples from 21 PWA using WhisperX in Python (model type: small.en, batch size: 32, learning rate: 5e-5, maximum sequence length: 512 tokens), which is a transformer-based model developed by Erium on OpenAI’s Whisper. Each trial-level transcription was compared to that of an experienced SLP who previously scored the PNT; then marked as human-AI match or mismatch. Since ~56% of all PWA responses had been off target with either semantic or phonemic errors, we also classified whether AI accurately retained these different types of errors in transcription. Lastly, we benchmarked AI performance at transcribing each given PWA’s speech by subcategory of the aphasia severity and type as measured by the Western Aphasia Battery. Results: SLP transcriptions took 2-4 hours on average for each PWA (~175 trials), whereas automated pipeline required 20-25 mins per subject (2-3min AI transcription + ~20 for a human annotator to align the transcript with each trial). Comparing to SLP transcription with >80% inter-rater match, the AI- assisted transcript achieved 73.63% overall match rate (75.43% with leniency for homophones, spelling and pluralization errors). The model correctly transcribed 96.36 % of semantic errors however only 20.17% of phonemic errors. WhisperX commonly erred by introducing phonemic errors (38.65% of mismatches, i.e. by substituting a phonologically similar word, e.g. “base” for “vase”); by lexicalizing erroneous utterances (29.95% of mismatches, i.e. substituting the target word even when PWA made a phonemic error); or by skipping the trial (23.62 % of mismatches). Discussion: WhisperX reliably transcribes semantic paraphasias but is less reliable in detecting phonemic errors, likely because its lexicalization of impaired speech matches desired processes from its training to auto-correct or interpolate lower quality recordings of healthy speech. Future models should be explicitly trained on impaired speech samples to overcome this issue. Overall, our work is the first to benchmark and implement transformer-based models in automating aphasic speech transcription that may increase clinical and research efficiency. Conclusion: We provide a benchmark of LLM-based real time transcription for speech in PWA. Future algorithmic refinement for phonemic intricacies using input data of impaired speech is necessary. WhisperX offers a scalable alternative to manual transcription, potentially streamlining clinical workflows and research assessments.

Topic Areas: Development of Resources, Software, Educational Materials, etc., Computational Approaches

SNL Account Login


Forgot Password?
Create an Account