TY - JOUR
T1 - A computational auditory scene analysis system for speech segregation and robust speech recognition
AU - Shao, Yang
AU - Srinivasan, Soundararajan
AU - Jin, Zhaozhang
AU - Wang, De Liang
N1 - Funding Information:
This research was supported in part by an AFOSR Grant (FA9550-04-1-0117), an AFRL Grant (FA8750-04-1-0093) and an NSF Grant (IIS-0534707). We are grateful to Guoning Hu for discussion and much assistance. We acknowledge the SLATE Lab (E. Fosler-Lussier) for providing computing resources. A preliminary version of this work was presented in 2006 Interspeech.
PY - 2010/1
Y1 - 2010/1
N2 - A conventional automatic speech recognizer does not perform well in the presence of multiple sound sources, while human listeners are able to segregate and recognize a signal of interest through auditory scene analysis. We present a computational auditory scene analysis system for separating and recognizing target speech in the presence of competing speech or noise. We estimate, in two stages, the ideal binary time-frequency (T-F) mask which retains the mixture in a local T-F unit if and only if the target is stronger than the interference within the unit. In the first stage, we use harmonicity to segregate the voiced portions of individual sources in each time frame based on multipitch tracking. Additionally, unvoiced portions are segmented based on an onset/offset analysis. In the second stage, speaker characteristics are used to group the T-F units across time frames. The resulting masks are used in an uncertainty decoding framework for automatic speech recognition. We evaluate our system on a speech separation challenge and show that our system yields substantial improvement over the baseline performance.
AB - A conventional automatic speech recognizer does not perform well in the presence of multiple sound sources, while human listeners are able to segregate and recognize a signal of interest through auditory scene analysis. We present a computational auditory scene analysis system for separating and recognizing target speech in the presence of competing speech or noise. We estimate, in two stages, the ideal binary time-frequency (T-F) mask which retains the mixture in a local T-F unit if and only if the target is stronger than the interference within the unit. In the first stage, we use harmonicity to segregate the voiced portions of individual sources in each time frame based on multipitch tracking. Additionally, unvoiced portions are segmented based on an onset/offset analysis. In the second stage, speaker characteristics are used to group the T-F units across time frames. The resulting masks are used in an uncertainty decoding framework for automatic speech recognition. We evaluate our system on a speech separation challenge and show that our system yields substantial improvement over the baseline performance.
KW - Binary time-frequency mask
KW - Computational Auditory Scene Analysis
KW - Robust speech recognition
KW - Speech segregation
KW - Uncertainty decoding
UR - http://www.scopus.com/inward/record.url?scp=69249159165&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=69249159165&partnerID=8YFLogxK
U2 - 10.1016/j.csl.2008.03.004
DO - 10.1016/j.csl.2008.03.004
M3 - Article
AN - SCOPUS:69249159165
SN - 0885-2308
VL - 24
SP - 77
EP - 93
JO - Computer Speech and Language
JF - Computer Speech and Language
IS - 1
ER -