TY - JOUR
T1 - Unsupervised Text-to-Speech Synthesis by Unsupervised Automatic Speech Recognition
AU - Ni, Junrui
AU - Wang, Liming
AU - Gao, Heting
AU - Qian, Kaizhi
AU - Zhang, Yang
AU - Chang, Shiyu
AU - Hasegawa-Johnson, Mark
N1 - Funding Information:
This work is supported by IBM-UIUC Center for Cognitive Computing Systems Research (C3SR). We would like to thank one anonymous reviewer for insights on Sec 4.4.
Publisher Copyright:
Copyright © 2022 ISCA.
PY - 2022
Y1 - 2022
N2 - An unsupervised text-to-speech synthesis (TTS) system learns to generate speech waveforms corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech. Developing such a system can significantly improve the availability of speech technology to languages without a large amount of parallel speech and text data. This paper proposes an unsupervised TTS system based on an alignment module that outputs pseudo-text and another synthesis module that uses pseudo-text for training and real text for inference. Our unsupervised system can achieve comparable performance to the supervised system in seven languages with about 10-20 hours of speech each. A careful study on the effect of text units and vocoders has also been conducted to better understand what factors may affect unsupervised TTS performance. The samples generated by our models can be found at https://cactuswiththoughts.github.io/UnsupTTS-Demo, and our code can be found at https://github.com/lwang114/UnsupTTS.
AB - An unsupervised text-to-speech synthesis (TTS) system learns to generate speech waveforms corresponding to any written sentence in a language by observing: 1) a collection of untranscribed speech waveforms in that language; 2) a collection of texts written in that language without access to any transcribed speech. Developing such a system can significantly improve the availability of speech technology to languages without a large amount of parallel speech and text data. This paper proposes an unsupervised TTS system based on an alignment module that outputs pseudo-text and another synthesis module that uses pseudo-text for training and real text for inference. Our unsupervised system can achieve comparable performance to the supervised system in seven languages with about 10-20 hours of speech each. A careful study on the effect of text units and vocoders has also been conducted to better understand what factors may affect unsupervised TTS performance. The samples generated by our models can be found at https://cactuswiththoughts.github.io/UnsupTTS-Demo, and our code can be found at https://github.com/lwang114/UnsupTTS.
KW - speech recognition
KW - speech synthesis
KW - unsupervised learning
UR - http://www.scopus.com/inward/record.url?scp=85140065436&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85140065436&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2022-816
DO - 10.21437/Interspeech.2022-816
M3 - Conference article
AN - SCOPUS:85140065436
SN - 2308-457X
VL - 2022-September
SP - 461
EP - 465
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 23rd Annual Conference of the International Speech Communication Association, INTERSPEECH 2022
Y2 - 18 September 2022 through 22 September 2022
ER -