TY - GEN
T1 - Personalized speech enhancement through self-supervised data augmentation and purification
AU - Sivaraman, Aswin
AU - Kim, Sunwoo
AU - Kim, Minje
N1 - Publisher Copyright:
Copyright © 2021 ISCA.
PY - 2021
Y1 - 2021
N2 - Training personalized speech enhancement models is innately a no-shot learning problem due to privacy constraints and limited access to noise-free speech from the target user. If there is an abundance of unlabeled noisy speech from the test-time user, one may train a personalized speech enhancement model using self-supervised learning. One straightforward approach to model personalization is to use the target speaker’s noisy recordings as pseudo-sources. Then, a pseudo denoising model learns to remove injected training noises and recover the pseudo-sources. However, this approach is volatile as it depends on the quality of the pseudo-sources, which may be too noisy. To remedy this, we propose a data purification step that refines the self-supervised approach. We first train an SNR predictor model to estimate the frame-by-frame SNR of the pseudo-sources. Then, we convert the predictor’s estimates into weights that adjust the pseudo-sources’ frame-by-frame contribution towards training the personalized model. We empirically show that the proposed data purification step improves the usability of the speaker-specific noisy data in the context of personalized speech enhancement. Our approach may be seen as privacy-preserving as it does not rely on any clean speech recordings or speaker embeddings.
AB - Training personalized speech enhancement models is innately a no-shot learning problem due to privacy constraints and limited access to noise-free speech from the target user. If there is an abundance of unlabeled noisy speech from the test-time user, one may train a personalized speech enhancement model using self-supervised learning. One straightforward approach to model personalization is to use the target speaker’s noisy recordings as pseudo-sources. Then, a pseudo denoising model learns to remove injected training noises and recover the pseudo-sources. However, this approach is volatile as it depends on the quality of the pseudo-sources, which may be too noisy. To remedy this, we propose a data purification step that refines the self-supervised approach. We first train an SNR predictor model to estimate the frame-by-frame SNR of the pseudo-sources. Then, we convert the predictor’s estimates into weights that adjust the pseudo-sources’ frame-by-frame contribution towards training the personalized model. We empirically show that the proposed data purification step improves the usability of the speaker-specific noisy data in the context of personalized speech enhancement. Our approach may be seen as privacy-preserving as it does not rely on any clean speech recordings or speaker embeddings.
KW - Model compression
KW - Privacy-preserving machine learning
KW - Self-supervised learning
KW - Speech enhancement
UR - http://www.scopus.com/inward/record.url?scp=85118523993&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85118523993&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2021-1868
DO - 10.21437/Interspeech.2021-1868
M3 - Conference contribution
AN - SCOPUS:85118523993
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 2208
EP - 2212
BT - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
PB - International Speech Communication Association
T2 - 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021
Y2 - 30 August 2021 through 3 September 2021
ER -