Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection
AuthorsGriffin Dietz Smith*, Dianna Yee*, Jennifer King Chen, Leah Findlater
AuthorsGriffin Dietz Smith*, Dianna Yee*, Jennifer King Chen, Leah Findlater
*Equal Contributors
Identifying mistakes (i.e., miscues) made while reading aloud is commonly approached post-hoc by comparing automatic speech recognition (ASR) transcriptions to the target reading text. However, post-hoc methods perform poorly when ASR inaccurately transcribes verbatim speech. To improve on current methods for reading error annotation, we propose a novel end-to-end architecture that incorporates the target reading text via prompting and is trained for both improved verbatim transcription and direct miscue detection. Our contributions include: first, demonstrating that incorporating reading text through prompting benefits verbatim transcription performance over fine-tuning, and second, showing that it is feasible to augment speech recognition tasks for end-to-end miscue detection. We conducted two case studies---children's read-aloud and adult atypical speech---and found that our proposed strategies improve verbatim transcription and miscue detection compared to current state-of-the-art.
May 20, 2024research area Human-Computer Interactionconference ACM Interaction Design and Children
March 7, 2024research area Data Science and Annotation, research area Speech and Natural Language Processing
Podcasting has grown to be a popular and powerful medium for storytelling, news, and entertainment. Without transcripts, podcasts may be inaccessible to people who are hard-of-hearing, deaf, or deaf-blind. However, ensuring that auto-generated podcast transcripts are readable and accurate is a challenge. The text needs to accurately reflect the meaning of what was spoken and be easy to read. The Apple Podcasts catalog contains millions of podcast episodes, which we transcribe using automatic speech recognition (ASR) models. To evaluate the quality of our ASR output, we compare a small number of human-generated, or reference, transcripts to corresponding ASR transcripts.