TY - JOUR
T1 - Model-based sequential organization in cochannel speech
AU - Shao, Yang
AU - Wang, Deliang
N1 - Funding Information:
Manuscript received May 7, 2004; revised December 3, 2004. This research was supported in part by AFOSR under Grant FA9550-04-1-0117, in part by the National Science Foundation under Grant IIS-0081058, and in part by the AFRL under Grant FA8750-04-1-0093. The associate editor coordinating the review of this manuscript and approving it for publication was Dr. Timothy J. Hazen.
PY - 2006/1
Y1 - 2006/1
N2 - A human listener has the ability to follow a speaker's voice while others are speaking simultaneously; in particular, the listener can organize the time-frequency energy of the same speaker across time into a single stream. In this paper, we focus on sequential organization in cochannel speech, or mixtures of two voices. We extract minimally corrupted segments, or usable speech, in cochannel speech using a robust multipitch tracking algorithm. The extracted usable speech is shown to capture speaker characteristics and improves speaker identification (SID) performance across various target-to-interferer ratios. To utilize speaker characteristics for sequential organization, we extend the traditional SID framework to cochannel speech and derive a joint objective for sequential grouping and SID, leading to a problem of search for the optimum hypothesis. Subsequently we propose a hypothesis pruning algorithm based on speaker models in order to make the search computationally efficient. Evaluation results show that the proposed system approaches the ceiling SID performance obtained with prior pitch information and yields significant improvement over alternative approaches to sequential organization.
AB - A human listener has the ability to follow a speaker's voice while others are speaking simultaneously; in particular, the listener can organize the time-frequency energy of the same speaker across time into a single stream. In this paper, we focus on sequential organization in cochannel speech, or mixtures of two voices. We extract minimally corrupted segments, or usable speech, in cochannel speech using a robust multipitch tracking algorithm. The extracted usable speech is shown to capture speaker characteristics and improves speaker identification (SID) performance across various target-to-interferer ratios. To utilize speaker characteristics for sequential organization, we extend the traditional SID framework to cochannel speech and derive a joint objective for sequential grouping and SID, leading to a problem of search for the optimum hypothesis. Subsequently we propose a hypothesis pruning algorithm based on speaker models in order to make the search computationally efficient. Evaluation results show that the proposed system approaches the ceiling SID performance obtained with prior pitch information and yields significant improvement over alternative approaches to sequential organization.
KW - Auditory scene analysis
KW - Cochannel speech
KW - Model-based approach
KW - Sequential organization
KW - Speaker identification (sid)
KW - Usable speech
UR - http://www.scopus.com/inward/record.url?scp=33744996003&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33744996003&partnerID=8YFLogxK
U2 - 10.1109/TSA.2005.854106
DO - 10.1109/TSA.2005.854106
M3 - Article
AN - SCOPUS:33744996003
SN - 1558-7916
VL - 14
SP - 289
EP - 298
JO - IEEE Transactions on Audio, Speech and Language Processing
JF - IEEE Transactions on Audio, Speech and Language Processing
IS - 1
ER -