On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition

Xiong Xiao, Shengkui Zhao, Douglas L Jones, Eng Siong Chng, Haizhou Li

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Acoustic beamforming has played a key role in the robust automatic speech recognition (ASR) applications. Accurate estimates of the speech and noise spatial covariance matrices (SCM) are crucial for successfully applying the minimum variance distortionless response (MVDR) beamforming. Reliable estimation of time-frequency (TF) masks can improve the estimation of the SCMs and significantly improve the performance of the MVDR beamforming in ASR tasks. In this paper, we focus on the TF mask estimation using recurrent neural networks (RNN). Specifically, our methods include training the RNN to estimate the speech and noise masks independently, training the RNN to minimize the ASR cost function directly, and performing multiple passes to iteratively improve the mask estimation. The proposed methods are evaluated individually and overally on the CHiME-4 challenge. The results show that the proposed methods improve the ASR performance individually and also work complementarily. The overall performance achieves a word error rate of 8.9% with 6-microphone configuration, which is much better than 12.0% achieved with the state-of-the-art MVDR implementation.

Original languageEnglish (US)
Title of host publication2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages3246-3250
Number of pages5
ISBN (Electronic)9781509041176
DOIs
StatePublished - Jun 16 2017
Externally publishedYes
Event2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - New Orleans, United States
Duration: Mar 5 2017Mar 9 2017

Publication series

NameICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISSN (Print)1520-6149

Other

Other2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017
CountryUnited States
CityNew Orleans
Period3/5/173/9/17

Fingerprint

Beamforming
Speech recognition
Masks
Recurrent neural networks
Microphones
Covariance matrix
Cost functions
Acoustics

Keywords

  • beamforming
  • long short-term memory
  • neural networks
  • robust speech recognition
  • time-frequency mask

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Electrical and Electronic Engineering

Cite this

Xiao, X., Zhao, S., Jones, D. L., Chng, E. S., & Li, H. (2017). On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings (pp. 3246-3250). [7952756] (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICASSP.2017.7952756

On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition. / Xiao, Xiong; Zhao, Shengkui; Jones, Douglas L; Chng, Eng Siong; Li, Haizhou.

2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. p. 3246-3250 7952756 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Xiao, X, Zhao, S, Jones, DL, Chng, ES & Li, H 2017, On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition. in 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings., 7952756, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings, Institute of Electrical and Electronics Engineers Inc., pp. 3246-3250, 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017, New Orleans, United States, 3/5/17. https://doi.org/10.1109/ICASSP.2017.7952756
Xiao X, Zhao S, Jones DL, Chng ES, Li H. On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition. In 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc. 2017. p. 3246-3250. 7952756. (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings). https://doi.org/10.1109/ICASSP.2017.7952756
Xiao, Xiong ; Zhao, Shengkui ; Jones, Douglas L ; Chng, Eng Siong ; Li, Haizhou. / On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition. 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings. Institute of Electrical and Electronics Engineers Inc., 2017. pp. 3246-3250 (ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings).
@inproceedings{ac68261dca54451c96608c71c21775f3,
title = "On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition",
abstract = "Acoustic beamforming has played a key role in the robust automatic speech recognition (ASR) applications. Accurate estimates of the speech and noise spatial covariance matrices (SCM) are crucial for successfully applying the minimum variance distortionless response (MVDR) beamforming. Reliable estimation of time-frequency (TF) masks can improve the estimation of the SCMs and significantly improve the performance of the MVDR beamforming in ASR tasks. In this paper, we focus on the TF mask estimation using recurrent neural networks (RNN). Specifically, our methods include training the RNN to estimate the speech and noise masks independently, training the RNN to minimize the ASR cost function directly, and performing multiple passes to iteratively improve the mask estimation. The proposed methods are evaluated individually and overally on the CHiME-4 challenge. The results show that the proposed methods improve the ASR performance individually and also work complementarily. The overall performance achieves a word error rate of 8.9{\%} with 6-microphone configuration, which is much better than 12.0{\%} achieved with the state-of-the-art MVDR implementation.",
keywords = "beamforming, long short-term memory, neural networks, robust speech recognition, time-frequency mask",
author = "Xiong Xiao and Shengkui Zhao and Jones, {Douglas L} and Chng, {Eng Siong} and Haizhou Li",
year = "2017",
month = "6",
day = "16",
doi = "10.1109/ICASSP.2017.7952756",
language = "English (US)",
series = "ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "3246--3250",
booktitle = "2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings",
address = "United States",

}

TY - GEN

T1 - On time-frequency mask estimation for MVDR beamforming with application in robust speech recognition

AU - Xiao, Xiong

AU - Zhao, Shengkui

AU - Jones, Douglas L

AU - Chng, Eng Siong

AU - Li, Haizhou

PY - 2017/6/16

Y1 - 2017/6/16

N2 - Acoustic beamforming has played a key role in the robust automatic speech recognition (ASR) applications. Accurate estimates of the speech and noise spatial covariance matrices (SCM) are crucial for successfully applying the minimum variance distortionless response (MVDR) beamforming. Reliable estimation of time-frequency (TF) masks can improve the estimation of the SCMs and significantly improve the performance of the MVDR beamforming in ASR tasks. In this paper, we focus on the TF mask estimation using recurrent neural networks (RNN). Specifically, our methods include training the RNN to estimate the speech and noise masks independently, training the RNN to minimize the ASR cost function directly, and performing multiple passes to iteratively improve the mask estimation. The proposed methods are evaluated individually and overally on the CHiME-4 challenge. The results show that the proposed methods improve the ASR performance individually and also work complementarily. The overall performance achieves a word error rate of 8.9% with 6-microphone configuration, which is much better than 12.0% achieved with the state-of-the-art MVDR implementation.

AB - Acoustic beamforming has played a key role in the robust automatic speech recognition (ASR) applications. Accurate estimates of the speech and noise spatial covariance matrices (SCM) are crucial for successfully applying the minimum variance distortionless response (MVDR) beamforming. Reliable estimation of time-frequency (TF) masks can improve the estimation of the SCMs and significantly improve the performance of the MVDR beamforming in ASR tasks. In this paper, we focus on the TF mask estimation using recurrent neural networks (RNN). Specifically, our methods include training the RNN to estimate the speech and noise masks independently, training the RNN to minimize the ASR cost function directly, and performing multiple passes to iteratively improve the mask estimation. The proposed methods are evaluated individually and overally on the CHiME-4 challenge. The results show that the proposed methods improve the ASR performance individually and also work complementarily. The overall performance achieves a word error rate of 8.9% with 6-microphone configuration, which is much better than 12.0% achieved with the state-of-the-art MVDR implementation.

KW - beamforming

KW - long short-term memory

KW - neural networks

KW - robust speech recognition

KW - time-frequency mask

UR - http://www.scopus.com/inward/record.url?scp=85023767020&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85023767020&partnerID=8YFLogxK

U2 - 10.1109/ICASSP.2017.7952756

DO - 10.1109/ICASSP.2017.7952756

M3 - Conference contribution

AN - SCOPUS:85023767020

T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings

SP - 3246

EP - 3250

BT - 2017 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2017 - Proceedings

PB - Institute of Electrical and Electronics Engineers Inc.

ER -