Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection

AuthorsGriffin Dietz Smith*, Dianna Yee*, Jennifer King Chen, Leah Findlater

*Equal Contributors

Identifying mistakes (i.e., miscues) made while reading aloud is commonly approached post-hoc by comparing automatic speech recognition (ASR) transcriptions to the target reading text. However, post-hoc methods perform poorly when ASR inaccurately transcribes verbatim speech. To improve on current methods for reading error annotation, we propose a novel end-to-end architecture that incorporates the target reading text via prompting and is trained for both improved verbatim transcription and direct miscue detection. Our contributions include: first, demonstrating that incorporating reading text through prompting benefits verbatim transcription performance over fine-tuning, and second, showing that it is feasible to augment speech recognition tasks for end-to-end miscue detection. We conducted two case studies---children's read-aloud and adult atypical speech---and found that our proposed strategies improve verbatim transcription and miscue detection compared to current state-of-the-art.

Related readings and updates.

May 20, 2024research area Human-Computer Interactionconference ACM Interaction Design and Children

Much of early literacy education happens at home with caretakers reading books to young children. Prior research demonstrates how having dialogue with children during co-reading can develop critical reading readiness skills, but most adult readers are unsure if and how to lead effective conversations. We present ContextQ, a tablet-based reading application to unobtrusively present auto-generated dialogic questions to caretakers to support this…

March 7, 2024research area Data Science and Annotation, research area Speech and Natural Language Processing

Podcasting has grown to be a popular and powerful medium for storytelling, news, and entertainment. Without transcripts, podcasts may be inaccessible to people who are hard-of-hearing, deaf, or deaf-blind. However, ensuring that auto-generated podcast transcripts are readable and accurate is a challenge. The text needs to accurately reflect the meaning of what was spoken and be easy to read. The Apple Podcasts catalog contains millions of podcast episodes, which we transcribe using automatic speech recognition (ASR) models. To evaluate the quality of our ASR output, we compare a small number of human-generated, or reference, transcripts to corresponding ASR transcripts.

Prompting Whisper for Improved Verbatim Transcription and End-to-end Miscue Detection

Related readings and updates.

ContextQ: Generated Questions to Support Meaningful Parent-Child Dialogue While Co-Reading

Humanizing Word Error Rate for ASR Transcript Readability and Accessibility

Discover opportunities in Machine Learning.