Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

Xiong Xiao, Shengkui Zhao, Duc Hoang Ha Nguyen, Xionghu Zhong, Douglas L Jones, Eng Siong Chng, Haizhou Li

Research output: Contribution to journalArticle

Abstract

This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear feature adaptation approaches for reducing reverberation in speech signals. In the nonlinear feature mapping approach, DNN is trained from parallel clean/distorted speech corpus to map reverberant and noisy speech coefficients (such as log magnitude spectrum) to the underlying clean speech coefficients. The constraint imposed by dynamic features (i.e., the time derivatives of the speech coefficients) are used to enhance the smoothness of predicted coefficient trajectories in two ways. One is to obtain the enhanced speech coefficients with a least square estimation from the coefficients and dynamic features predicted by DNN. The other is to incorporate the constraint of dynamic features directly into the DNN training process using a sequential cost function. In the linear feature adaptation approach, a sparse linear transform, called cross transform, is used to transform multiple frames of speech coefficients to a new feature space. The transform is estimated to maximize the likelihood of the transformed coefficients given a model of clean speech coefficients. Unlike the DNN approach, no parallel corpus is used and no assumption on distortion types is made. The two approaches are evaluated on the REVERB Challenge 2014 tasks. Both speech enhancement and automatic speech recognition (ASR) results show that the DNN-based mappings significantly reduce the reverberation in speech and improve both speech quality and ASR performance. For the speech enhancement task, the proposed dynamic feature constraint help to improve cepstral distance, frequency-weighted segmental signal-to-noise ratio (SNR), and log likelihood ratio metrics while moderately degrades the speech-to-reverberation modulation energy ratio. In addition, the cross transform feature adaptation improves the ASR performance significantly for clean-condition trained acoustic models.

Original languageEnglish (US)
Article number4
Pages (from-to)1-18
Number of pages18
JournalEurasip Journal on Advances in Signal Processing
Volume2016
Issue number1
DOIs
StatePublished - Dec 1 2016
Externally publishedYes

Fingerprint

Reverberation
Speech recognition
Speech enhancement
Deep neural networks
Cost functions
Signal to noise ratio
Acoustics
Trajectories
Modulation
Derivatives

Keywords

  • Beamforming
  • Deep neural networks
  • Dynamic features
  • Feature adaptation
  • Reverberation challenge
  • Robust speech recognition
  • Speech enhancement

ASJC Scopus subject areas

  • Signal Processing
  • Information Systems
  • Hardware and Architecture
  • Electrical and Electronic Engineering

Cite this

Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation. / Xiao, Xiong; Zhao, Shengkui; Ha Nguyen, Duc Hoang; Zhong, Xionghu; Jones, Douglas L; Chng, Eng Siong; Li, Haizhou.

In: Eurasip Journal on Advances in Signal Processing, Vol. 2016, No. 1, 4, 01.12.2016, p. 1-18.

Research output: Contribution to journalArticle

Xiao, Xiong ; Zhao, Shengkui ; Ha Nguyen, Duc Hoang ; Zhong, Xionghu ; Jones, Douglas L ; Chng, Eng Siong ; Li, Haizhou. / Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation. In: Eurasip Journal on Advances in Signal Processing. 2016 ; Vol. 2016, No. 1. pp. 1-18.
@article{a49ee61d83fa48f585be1e00191449fe,
title = "Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation",
abstract = "This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear feature adaptation approaches for reducing reverberation in speech signals. In the nonlinear feature mapping approach, DNN is trained from parallel clean/distorted speech corpus to map reverberant and noisy speech coefficients (such as log magnitude spectrum) to the underlying clean speech coefficients. The constraint imposed by dynamic features (i.e., the time derivatives of the speech coefficients) are used to enhance the smoothness of predicted coefficient trajectories in two ways. One is to obtain the enhanced speech coefficients with a least square estimation from the coefficients and dynamic features predicted by DNN. The other is to incorporate the constraint of dynamic features directly into the DNN training process using a sequential cost function. In the linear feature adaptation approach, a sparse linear transform, called cross transform, is used to transform multiple frames of speech coefficients to a new feature space. The transform is estimated to maximize the likelihood of the transformed coefficients given a model of clean speech coefficients. Unlike the DNN approach, no parallel corpus is used and no assumption on distortion types is made. The two approaches are evaluated on the REVERB Challenge 2014 tasks. Both speech enhancement and automatic speech recognition (ASR) results show that the DNN-based mappings significantly reduce the reverberation in speech and improve both speech quality and ASR performance. For the speech enhancement task, the proposed dynamic feature constraint help to improve cepstral distance, frequency-weighted segmental signal-to-noise ratio (SNR), and log likelihood ratio metrics while moderately degrades the speech-to-reverberation modulation energy ratio. In addition, the cross transform feature adaptation improves the ASR performance significantly for clean-condition trained acoustic models.",
keywords = "Beamforming, Deep neural networks, Dynamic features, Feature adaptation, Reverberation challenge, Robust speech recognition, Speech enhancement",
author = "Xiong Xiao and Shengkui Zhao and {Ha Nguyen}, {Duc Hoang} and Xionghu Zhong and Jones, {Douglas L} and Chng, {Eng Siong} and Haizhou Li",
year = "2016",
month = "12",
day = "1",
doi = "10.1186/s13634-015-0300-4",
language = "English (US)",
volume = "2016",
pages = "1--18",
journal = "Eurasip Journal on Advances in Signal Processing",
issn = "1687-6172",
publisher = "Springer Publishing Company",
number = "1",

}

TY - JOUR

T1 - Speech dereverberation for enhancement and recognition using dynamic features constrained deep neural networks and feature adaptation

AU - Xiao, Xiong

AU - Zhao, Shengkui

AU - Ha Nguyen, Duc Hoang

AU - Zhong, Xionghu

AU - Jones, Douglas L

AU - Chng, Eng Siong

AU - Li, Haizhou

PY - 2016/12/1

Y1 - 2016/12/1

N2 - This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear feature adaptation approaches for reducing reverberation in speech signals. In the nonlinear feature mapping approach, DNN is trained from parallel clean/distorted speech corpus to map reverberant and noisy speech coefficients (such as log magnitude spectrum) to the underlying clean speech coefficients. The constraint imposed by dynamic features (i.e., the time derivatives of the speech coefficients) are used to enhance the smoothness of predicted coefficient trajectories in two ways. One is to obtain the enhanced speech coefficients with a least square estimation from the coefficients and dynamic features predicted by DNN. The other is to incorporate the constraint of dynamic features directly into the DNN training process using a sequential cost function. In the linear feature adaptation approach, a sparse linear transform, called cross transform, is used to transform multiple frames of speech coefficients to a new feature space. The transform is estimated to maximize the likelihood of the transformed coefficients given a model of clean speech coefficients. Unlike the DNN approach, no parallel corpus is used and no assumption on distortion types is made. The two approaches are evaluated on the REVERB Challenge 2014 tasks. Both speech enhancement and automatic speech recognition (ASR) results show that the DNN-based mappings significantly reduce the reverberation in speech and improve both speech quality and ASR performance. For the speech enhancement task, the proposed dynamic feature constraint help to improve cepstral distance, frequency-weighted segmental signal-to-noise ratio (SNR), and log likelihood ratio metrics while moderately degrades the speech-to-reverberation modulation energy ratio. In addition, the cross transform feature adaptation improves the ASR performance significantly for clean-condition trained acoustic models.

AB - This paper investigates deep neural networks (DNN) based on nonlinear feature mapping and statistical linear feature adaptation approaches for reducing reverberation in speech signals. In the nonlinear feature mapping approach, DNN is trained from parallel clean/distorted speech corpus to map reverberant and noisy speech coefficients (such as log magnitude spectrum) to the underlying clean speech coefficients. The constraint imposed by dynamic features (i.e., the time derivatives of the speech coefficients) are used to enhance the smoothness of predicted coefficient trajectories in two ways. One is to obtain the enhanced speech coefficients with a least square estimation from the coefficients and dynamic features predicted by DNN. The other is to incorporate the constraint of dynamic features directly into the DNN training process using a sequential cost function. In the linear feature adaptation approach, a sparse linear transform, called cross transform, is used to transform multiple frames of speech coefficients to a new feature space. The transform is estimated to maximize the likelihood of the transformed coefficients given a model of clean speech coefficients. Unlike the DNN approach, no parallel corpus is used and no assumption on distortion types is made. The two approaches are evaluated on the REVERB Challenge 2014 tasks. Both speech enhancement and automatic speech recognition (ASR) results show that the DNN-based mappings significantly reduce the reverberation in speech and improve both speech quality and ASR performance. For the speech enhancement task, the proposed dynamic feature constraint help to improve cepstral distance, frequency-weighted segmental signal-to-noise ratio (SNR), and log likelihood ratio metrics while moderately degrades the speech-to-reverberation modulation energy ratio. In addition, the cross transform feature adaptation improves the ASR performance significantly for clean-condition trained acoustic models.

KW - Beamforming

KW - Deep neural networks

KW - Dynamic features

KW - Feature adaptation

KW - Reverberation challenge

KW - Robust speech recognition

KW - Speech enhancement

UR - http://www.scopus.com/inward/record.url?scp=84955075500&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84955075500&partnerID=8YFLogxK

U2 - 10.1186/s13634-015-0300-4

DO - 10.1186/s13634-015-0300-4

M3 - Article

AN - SCOPUS:84955075500

VL - 2016

SP - 1

EP - 18

JO - Eurasip Journal on Advances in Signal Processing

JF - Eurasip Journal on Advances in Signal Processing

SN - 1687-6172

IS - 1

M1 - 4

ER -