TY - GEN
T1 - Test-Time Adaptation Toward Personalized Speech Enhancement
T2 - 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2021
AU - Kim, Sunwoo
AU - Kim, Minje
N1 - ∗This material is based upon work supported by the National Science Foundation under Grant No. 2046963.
PY - 2021
Y1 - 2021
N2 - In realistic speech enhancement settings for end-user devices, we often encounter only a few speakers and noise types that tend to reoccur in the specific acoustic environment. We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity. Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker, thus fulfilling the requirement for zero-shot learning. To complement the lack of clean speech, we employ the knowledge distillation framework: we distill the more advanced denoising results from an overly large teacher model, and use them as the pseudo target to train the small student model. This zero-shot learning procedure circumvents the process of collecting users' clean speech, a process that users are reluctant to comply due to privacy concerns and technical difficulty of recording clean voice. Experiments on various test-time conditions show that the proposed personalization method can significantly improve the compact models' performance during the test time. Furthermore, since the personalized models outperform larger non-personalized baseline models, we claim that personalization achieves model compression with no loss of denoising performance. As expected, the student models underperform the state-of-the-art teacher models.
AB - In realistic speech enhancement settings for end-user devices, we often encounter only a few speakers and noise types that tend to reoccur in the specific acoustic environment. We propose a novel personalized speech enhancement method to adapt a compact denoising model to the test-time specificity. Our goal in this test-time adaptation is to utilize no clean speech target of the test speaker, thus fulfilling the requirement for zero-shot learning. To complement the lack of clean speech, we employ the knowledge distillation framework: we distill the more advanced denoising results from an overly large teacher model, and use them as the pseudo target to train the small student model. This zero-shot learning procedure circumvents the process of collecting users' clean speech, a process that users are reluctant to comply due to privacy concerns and technical difficulty of recording clean voice. Experiments on various test-time conditions show that the proposed personalization method can significantly improve the compact models' performance during the test time. Furthermore, since the personalized models outperform larger non-personalized baseline models, we claim that personalization achieves model compression with no loss of denoising performance. As expected, the student models underperform the state-of-the-art teacher models.
KW - knowledge distillation
KW - model compression
KW - personalization
KW - Speech enhancement
KW - zero-shot learning
UR - http://www.scopus.com/inward/record.url?scp=85119878384&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119878384&partnerID=8YFLogxK
U2 - 10.1109/WASPAA52581.2021.9632771
DO - 10.1109/WASPAA52581.2021.9632771
M3 - Conference contribution
AN - SCOPUS:85119878384
T3 - IEEE Workshop on Applications of Signal Processing to Audio and Acoustics
SP - 176
EP - 180
BT - 2021 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA 2021
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 17 October 2021 through 20 October 2021
ER -