SpeakStream: Streaming Text-to-Speech with Interleaved Data
AuthorsRichard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly
AuthorsRichard He Bai, Zijin Gu, Tatiana Likhomanenko, Navdeep Jaitly
With the increasing integration of speech front-ends and large language models (LLM), there is a need to explore architectures that integrate these modalities. While end-to-end models have been explored extensively, cascaded models that stream outputs from LLMs to TTS seem to be oddly under-explored, even though they are potentially much simpler. Using traditional text-to-speech systems to convert LLM outputs to audio, however, poses a technical problem because they need entire utterances to generate sytlistic audio. In this paper we present a 'streaming' TTS that can generate audio from streaming text using a novel decoder-only architecture that interleaves text and speech. The model is trained using next-step prediction on interleaved data that is generated from force-alignment of text transcripts to speech. Duing inference our system processes text incrementally while generating consistent speech output, making it suitable for real-time applications like conversational AI agents where an LLM can stream text to a TTS system. Results demonstrate that our approach matches the quality of batch TTS systems while enabling streaming capabilities.
May 23, 2022research area Methods and Algorithms, research area Speech and Natural Language Processingconference ICASSP
April 5, 2021research area Speech and Natural Language Processingconference EACL