TY - JOUR
T1 - Glimpse-based estimation of speech intelligibility from speech-in-noise using artificial neural networks
AU - Tang, Yan
N1 - Funding Information:
This work was partly supported by the UK EPSRC Programme Grant S3A: Future Spatial Audio for an Immersive Listener Experience at Home (EP/L000539/1) and the British Broadcasting Company (BBC) as part of the BBC Audio Research Partnership. The author would also like to thank Trevor Cox for discussion at the early stage of this study, and Lara Harris for recruiting the participants and for organising the listening experiments.
Publisher Copyright:
© 2021 Elsevier Ltd
PY - 2021/9
Y1 - 2021/9
N2 - While human listeners can, to some extent, understand the information conveyed by the speech signal when it is mixed with noise, traditional objective intelligibility measures usually fail to operate without a priori knowledge of the clean speech signal. This hence limits the usability of those measures in situations where the clean speech signal is inaccessible. In this paper a glimpse-based method is extended to make speech intelligibility predictions directly from speech-plus-noise mixtures. Using a neural network, the proposed method estimates the time-frequency regions with a local speech-to-noise ratio above a given threshold -- known as glimpses -- from the mixture signal, instead of separately comparing the speech signal against the noise signal. The number and locations of the glimpses can then be used to produce an intelligibility score. In Experiment I where listener intelligibilities were measured in one stationary and nine fluctuating noise maskers, the predictions produced by the proposed method were highly correlated with the subjective data, with correlation coefficients above 0.90. In Experiment II, with the same neural network trained on normal natural speech as in Experiment I, the proposed method was used to predict the intelligibility of speech signals modified by intelligibility-enhancement algorithms and synthetic speech. The method can still maintain its predictive power by demonstrating a similar performance to its intrusive counterpart with an overall correlation coefficient of 0.81, which is superior to many modern traditional measures evaluated under the same conditions. Therefore, the proposed method can be used to estimate speech intelligibility in place of traditional measures in conditions where their capacity falls short.
AB - While human listeners can, to some extent, understand the information conveyed by the speech signal when it is mixed with noise, traditional objective intelligibility measures usually fail to operate without a priori knowledge of the clean speech signal. This hence limits the usability of those measures in situations where the clean speech signal is inaccessible. In this paper a glimpse-based method is extended to make speech intelligibility predictions directly from speech-plus-noise mixtures. Using a neural network, the proposed method estimates the time-frequency regions with a local speech-to-noise ratio above a given threshold -- known as glimpses -- from the mixture signal, instead of separately comparing the speech signal against the noise signal. The number and locations of the glimpses can then be used to produce an intelligibility score. In Experiment I where listener intelligibilities were measured in one stationary and nine fluctuating noise maskers, the predictions produced by the proposed method were highly correlated with the subjective data, with correlation coefficients above 0.90. In Experiment II, with the same neural network trained on normal natural speech as in Experiment I, the proposed method was used to predict the intelligibility of speech signals modified by intelligibility-enhancement algorithms and synthetic speech. The method can still maintain its predictive power by demonstrating a similar performance to its intrusive counterpart with an overall correlation coefficient of 0.81, which is superior to many modern traditional measures evaluated under the same conditions. Therefore, the proposed method can be used to estimate speech intelligibility in place of traditional measures in conditions where their capacity falls short.
KW - Artificial neural network
KW - Glimpse
KW - Noise
KW - Non-intrusive
KW - Objective intelligibility measure
KW - Speech intelligibility
UR - http://www.scopus.com/inward/record.url?scp=85102975609&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85102975609&partnerID=8YFLogxK
U2 - 10.1016/j.csl.2021.101220
DO - 10.1016/j.csl.2021.101220
M3 - Article
SN - 0885-2308
VL - 69
JO - Computer Speech and Language
JF - Computer Speech and Language
M1 - 101220
ER -