TY - JOUR
T1 - RemixIT
T2 - Continual Self-Training of Speech Enhancement Models via Bootstrapped Remixing
AU - Tzinis, Efthymios
AU - Adi, Yossi
AU - Ithapu, Vamsi K.
AU - Xu, Buye
AU - Smaragdis, Paris
AU - Kumar, Anurag
N1 - Publisher Copyright:
© 2007-2012 IEEE.
PY - 2022/10/1
Y1 - 2022/10/1
N2 - We present RemixIT, a simple yet effective self-supervised method for training speech enhancement without the need of a single isolated in-domain speech nor a noise waveform. Our approach overcomes limitations of previous methods which make them dependent on clean in-domain target signals and thus, sensitive to any domain mismatch between train and test samples. RemixIT is based on a continuous self-training scheme in which a pre-trained teacher model on out-of-domain data infers estimated pseudo-target signals for in-domain mixtures. Then, by permuting the estimated clean and noise signals and remixing them together, we generate a new set of bootstrapped mixtures and corresponding pseudo-targets which are used to train the student network. Vice-versa, the teacher periodically refines its estimates using the updated parameters of the latest student models. Experimental results on multiple speech enhancement datasets and tasks not only show the superiority of our method over prior approaches but also showcase that RemixIT can be combined with any separation model as well as be applied towards any semi-supervised and unsupervised domain adaptation task. Our analysis, paired with empirical evidence, sheds light on the inside functioning of our self-training scheme wherein the student model keeps obtaining better performance while observing severely degraded pseudo-targets.
AB - We present RemixIT, a simple yet effective self-supervised method for training speech enhancement without the need of a single isolated in-domain speech nor a noise waveform. Our approach overcomes limitations of previous methods which make them dependent on clean in-domain target signals and thus, sensitive to any domain mismatch between train and test samples. RemixIT is based on a continuous self-training scheme in which a pre-trained teacher model on out-of-domain data infers estimated pseudo-target signals for in-domain mixtures. Then, by permuting the estimated clean and noise signals and remixing them together, we generate a new set of bootstrapped mixtures and corresponding pseudo-targets which are used to train the student network. Vice-versa, the teacher periodically refines its estimates using the updated parameters of the latest student models. Experimental results on multiple speech enhancement datasets and tasks not only show the superiority of our method over prior approaches but also showcase that RemixIT can be combined with any separation model as well as be applied towards any semi-supervised and unsupervised domain adaptation task. Our analysis, paired with empirical evidence, sheds light on the inside functioning of our self-training scheme wherein the student model keeps obtaining better performance while observing severely degraded pseudo-targets.
KW - Self-supervised learning
KW - semi-supervised self-training
KW - speech enhancement
KW - zero-shot domain adaptation
UR - http://www.scopus.com/inward/record.url?scp=85137551140&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85137551140&partnerID=8YFLogxK
U2 - 10.1109/JSTSP.2022.3200911
DO - 10.1109/JSTSP.2022.3200911
M3 - Article
AN - SCOPUS:85137551140
SN - 1932-4553
VL - 16
SP - 1329
EP - 1341
JO - IEEE Journal on Selected Topics in Signal Processing
JF - IEEE Journal on Selected Topics in Signal Processing
IS - 6
ER -