TY - GEN
T1 - Performance Based Cost Functions for End-to-End Speech Separation
AU - Venkataramani, Shrikant
AU - Higa, Ryley
AU - Smaragdis, Paris
N1 - This work was supported by NSF grant 1453104
PY - 2018/7/2
Y1 - 2018/7/2
N2 - Recent neural network strategies for source separation attempt to model audio signals by processing their waveforms directly. Mean squared error (MSE) that measures the Euclidean distance between waveforms of denoised speech and the ground-truth speech, has been a natural cost-function for these approaches. However, MSE is not a perceptually motivated measure and may result in large perceptual discrepancies. In this paper, we propose and experiment with new loss functions for end-to-end source separation. These loss functions are motivated by BSS-Eval and perceptual metrics like source to distortion ratio (SDR), source to interference ratio (SIR), source to artifact ratio (SAR) and short-time objective intelligibility ratio (STOI). This enables the flexibility to mix and match these loss functions depending upon the requirements of the task. Subjective listening tests reveal that combinations of the proposed cost functions help achieve superior separation performance as compared to stand-alone MSE and SDR costs.
AB - Recent neural network strategies for source separation attempt to model audio signals by processing their waveforms directly. Mean squared error (MSE) that measures the Euclidean distance between waveforms of denoised speech and the ground-truth speech, has been a natural cost-function for these approaches. However, MSE is not a perceptually motivated measure and may result in large perceptual discrepancies. In this paper, we propose and experiment with new loss functions for end-to-end source separation. These loss functions are motivated by BSS-Eval and perceptual metrics like source to distortion ratio (SDR), source to interference ratio (SIR), source to artifact ratio (SAR) and short-time objective intelligibility ratio (STOI). This enables the flexibility to mix and match these loss functions depending upon the requirements of the task. Subjective listening tests reveal that combinations of the proposed cost functions help achieve superior separation performance as compared to stand-alone MSE and SDR costs.
KW - Cost functions
KW - Deep learning
KW - End-to-end speech separation
UR - http://www.scopus.com/inward/record.url?scp=85063437706&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85063437706&partnerID=8YFLogxK
U2 - 10.23919/APSIPA.2018.8659758
DO - 10.23919/APSIPA.2018.8659758
M3 - Conference contribution
AN - SCOPUS:85063437706
T3 - 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2018 - Proceedings
SP - 350
EP - 355
BT - 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2018 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 10th Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, APSIPA ASC 2018
Y2 - 12 November 2018 through 15 November 2018
ER -