Speech intelligibility prediction using spectro-temporal modulation analysis

Amin Edraki, Wai Yip Chan, Jesper Jensen, Daniel Fogerty

Research output: Contribution to journalArticlepeer-review

Abstract

Spectro-temporal modulations are believed to mediate the analysis of speech sounds in the human primary auditory cortex. Inspired by humans' robustness in comprehending speech in challenging acoustic environments, we propose an intrusive speech intelligibility prediction (SIP) algorithm, wSTMI, for normal-hearing listeners based on spectro-temporal modulation analysis (STMA) of the clean and degraded speech signals. In the STMA, each of 55 modulation frequency channels contributes an intermediate intelligibility measure. A sparse linear model with parameters optimized using Lasso regression results in combining the intermediate measures of 8 of the most salient channels for SIP. In comparison with a suite of 10 SIP algorithms, wSTMI performs consistently well across 13 datasets, which together cover degradation conditions including modulated noise, noise reduction processing, reverberation, near-end listening enhancement, and speech interruption. We show that the optimized parameters of wSTMI may be interpreted in terms of modulation transfer functions of the human auditory system. Thus, the proposed approach offers evidence affirming previous studies of the perceptual characteristics underlying speech signal intelligibility.

Original languageEnglish (US)
Article number9269417
Pages (from-to)210-225
Number of pages16
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
Volume29
DOIs
StatePublished - 2021

Keywords

  • Spectro-temporal modulation
  • speech intelligibility
  • speech quality model

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Speech intelligibility prediction using spectro-temporal modulation analysis'. Together they form a unique fingerprint.

Cite this