TY - JOUR
T1 - Visualizing phoneme category adaptation in deep neural networks
AU - Scharenborg, Odette
AU - Tiesmeyer, Sebastian
AU - Hasegawa-Johnson, Mark
AU - Dehak, Najim
N1 - Funding Information:
O.S. was partly supported by a Vidi-grant from The Netherlands Organization for Scientific Research (NWO; grant number: 276-89-003). The authors would like to thank Raghavedra Pappagari for writing code to accumulate feature vector activations within phonetic segments.
Publisher Copyright:
© 2018 International Speech Communication Association. All rights reserved.
PY - 2018
Y1 - 2018
N2 - Both human listeners and machines need to adapt their sound categories whenever a new speaker is encountered. This perceptual learning is driven by lexical information. The aim of this paper is two-fold: investigate whether a deep neural network-based (DNN) ASR system can adapt to only a few examples of ambiguous speech as humans have been found to do; investigate a DNN's ability to serve as a model of human perceptual learning. Crucially, we do so by looking at intermediate levels of phoneme category adaptation rather than at the output level. We visualize the activations in the hidden layers of the DNN during perceptual learning. The results show that, similar to humans, DNN systems learn speaker-adapted phone category boundaries from a few labeled examples. The DNN adapts its category boundaries not only by adapting the weights of the output layer, but also by adapting the implicit feature maps computed by the hidden layers, suggesting the possibility that human perceptual learning might involve a similar nonlinear distortion of a perceptual space that is intermediate between the acoustic input and the phonological categories. Comparisons between DNNs and humans can thus provide valuable insights into the way humans process speech and improve ASR technology.
AB - Both human listeners and machines need to adapt their sound categories whenever a new speaker is encountered. This perceptual learning is driven by lexical information. The aim of this paper is two-fold: investigate whether a deep neural network-based (DNN) ASR system can adapt to only a few examples of ambiguous speech as humans have been found to do; investigate a DNN's ability to serve as a model of human perceptual learning. Crucially, we do so by looking at intermediate levels of phoneme category adaptation rather than at the output level. We visualize the activations in the hidden layers of the DNN during perceptual learning. The results show that, similar to humans, DNN systems learn speaker-adapted phone category boundaries from a few labeled examples. The DNN adapts its category boundaries not only by adapting the weights of the output layer, but also by adapting the implicit feature maps computed by the hidden layers, suggesting the possibility that human perceptual learning might involve a similar nonlinear distortion of a perceptual space that is intermediate between the acoustic input and the phonological categories. Comparisons between DNNs and humans can thus provide valuable insights into the way humans process speech and improve ASR technology.
KW - Deep neural networks
KW - Human perceptual learning
KW - Phoneme category adaptation
KW - Visualization
UR - http://www.scopus.com/inward/record.url?scp=85054976311&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85054976311&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2018-1707
DO - 10.21437/Interspeech.2018-1707
M3 - Conference article
AN - SCOPUS:85054976311
SN - 2308-457X
VL - 2018-September
SP - 1482
EP - 1486
JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
T2 - 19th Annual Conference of the International Speech Communication, INTERSPEECH 2018
Y2 - 2 September 2018 through 6 September 2018
ER -