TY - GEN
T1 - Global Rhythm Style Transfer Without Text Transcriptions
AU - Qian, Kaizhi
AU - Zhang, Yang
AU - Chang, Shiyu
AU - Xiong, Jinjun
AU - Gan, Chuang
AU - Cox, David
AU - Hasegawa-Johnson, Mark
N1 - Publisher Copyright:
Copyright © 2021 by the author(s)
PY - 2021
Y1 - 2021
N2 - Prosody plays an important role in characterizing the style of a speaker or an emotion, but most non-parallel voice or emotion style transfer algorithms do not convert any prosody information. Two major components of prosody are pitch and rhythm. Disentangling the prosody information, particularly the rhythm component, from the speech is challenging because it involves breaking the synchrony between the input speech and the disentangled speech representation. As a result, most existing prosody style transfer algorithms would need to rely on some form of text transcriptions to identify the content information, which confines their application to high-resource languages only. Recently, SPEECHSPLIT (Qian et al., 2020b) has made sizeable progress towards unsupervised prosody style transfer, but it is unable to extract high-level global prosody style in an unsupervised manner. In this paper, we propose AUTOPST, which can disentangle global prosody style from speech without relying on any text transcriptions. AUTOPST is an Autoencoderbased Prosody Style Transfer framework with a thorough rhythm removal module guided by self-expressive representation learning. Experiments on different style transfer tasks show that AUTOPST can effectively convert prosody that correctly reflects the styles of the target domains.
AB - Prosody plays an important role in characterizing the style of a speaker or an emotion, but most non-parallel voice or emotion style transfer algorithms do not convert any prosody information. Two major components of prosody are pitch and rhythm. Disentangling the prosody information, particularly the rhythm component, from the speech is challenging because it involves breaking the synchrony between the input speech and the disentangled speech representation. As a result, most existing prosody style transfer algorithms would need to rely on some form of text transcriptions to identify the content information, which confines their application to high-resource languages only. Recently, SPEECHSPLIT (Qian et al., 2020b) has made sizeable progress towards unsupervised prosody style transfer, but it is unable to extract high-level global prosody style in an unsupervised manner. In this paper, we propose AUTOPST, which can disentangle global prosody style from speech without relying on any text transcriptions. AUTOPST is an Autoencoderbased Prosody Style Transfer framework with a thorough rhythm removal module guided by self-expressive representation learning. Experiments on different style transfer tasks show that AUTOPST can effectively convert prosody that correctly reflects the styles of the target domains.
UR - http://www.scopus.com/inward/record.url?scp=85161331059&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85161331059&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85161331059
T3 - Proceedings of Machine Learning Research
SP - 8650
EP - 8660
BT - Proceedings of the 38th International Conference on Machine Learning, ICML 2021
PB - ML Research Press
T2 - 38th International Conference on Machine Learning, ICML 2021
Y2 - 18 July 2021 through 24 July 2021
ER -