Multi-decoder dprnn: Source separation for variable number of speakers

Junzhe Zhu, Raymond A. Yeh, Mark Hasegawa-Johnson

Research output: Contribution to journalConference articlepeer-review


We propose an end-to-end trainable approach to singlechannel speech separation with unknown number of speakers. Our approach extends the MulCat source separation backbone with additional output heads: A count-head to infer the number of speakers, and decoder-heads for reconstructing the original signals. Beyond the model, we also propose a metric on how to evaluate source separation with variable number of speakers. Specifically, we clear up the issue on how to evaluate the quality when the ground-truth has more or less speakers than the ones predicted by the model. We evaluate our approach on the WSJ0-mix datasets, with mixtures up to five speakers. We demonstrate that our approach outperforms state-of-the-art in counting the number of speakers and remains competitive in quality of reconstructed signals.

Original languageEnglish (US)
Pages (from-to)3420-3424
Number of pages5
JournalICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
StatePublished - 2021
Event2021 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2021 - Virtual, Toronto, Canada
Duration: Jun 6 2021Jun 11 2021


  • Source separation

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering


Dive into the research topics of 'Multi-decoder dprnn: Source separation for variable number of speakers'. Together they form a unique fingerprint.

Cite this