TY - JOUR
T1 - Real-world acoustic event detection
AU - Zhuang, Xiaodan
AU - Zhou, Xi
AU - Hasegawa-Johnson, Mark A.
AU - Huang, Thomas S.
N1 - Funding Information:
This work was funded by NSF Grants 08-03219 and 08-07329 . The results and conclusions expressed in this paper are those of the authors, and are not endorsed by the NSF.
PY - 2010/9/1
Y1 - 2010/9/1
N2 - Acoustic Event Detection (AED) aims to identify both timestamps and types of events in an audio stream. This becomes very challenging when going beyond restricted highlight events and well controlled recordings. We propose extracting discriminative features for AED using a boosting approach, which outperform classical speech perceptual features, such as Mel-frequency Cepstral Coefficients and log frequency filterbank parameters. We propose leveraging statistical models better fitting the task. First, a tandem connectionist-HMM approach combines the sequence modeling capabilities of the HMM with the high-accuracy context-dependent discriminative capabilities of an artificial neural network trained using the minimum cross entropy criterion. Second, an SVM-GMM-supervector approach uses noise-adaptive kernels better approximating the KL divergence between feature distributions in different audio segments. Experiments on the CLEAR 2007 AED Evaluation set-up demonstrate that the presented features and models lead to over 45% relative performance improvement, and also outperform the best system in the CLEAR AED Evaluation, on detection of twelve general acoustic events in a real seminar environment.
AB - Acoustic Event Detection (AED) aims to identify both timestamps and types of events in an audio stream. This becomes very challenging when going beyond restricted highlight events and well controlled recordings. We propose extracting discriminative features for AED using a boosting approach, which outperform classical speech perceptual features, such as Mel-frequency Cepstral Coefficients and log frequency filterbank parameters. We propose leveraging statistical models better fitting the task. First, a tandem connectionist-HMM approach combines the sequence modeling capabilities of the HMM with the high-accuracy context-dependent discriminative capabilities of an artificial neural network trained using the minimum cross entropy criterion. Second, an SVM-GMM-supervector approach uses noise-adaptive kernels better approximating the KL divergence between feature distributions in different audio segments. Experiments on the CLEAR 2007 AED Evaluation set-up demonstrate that the presented features and models lead to over 45% relative performance improvement, and also outperform the best system in the CLEAR AED Evaluation, on detection of twelve general acoustic events in a real seminar environment.
KW - Acoustic Event Detection
KW - Artificial neural network
KW - Feature selection
KW - Gaussian mixture model supervector
KW - Hidden markov model
KW - Tandem model
UR - http://www.scopus.com/inward/record.url?scp=77955558847&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77955558847&partnerID=8YFLogxK
U2 - 10.1016/j.patrec.2010.02.005
DO - 10.1016/j.patrec.2010.02.005
M3 - Article
AN - SCOPUS:77955558847
SN - 0167-8655
VL - 31
SP - 1543
EP - 1551
JO - Pattern Recognition Letters
JF - Pattern Recognition Letters
IS - 12
ER -