Visualizing phoneme category adaptation in deep neural networks

Odette Scharenborg, Sebastian Tiesmeyer, Mark Allan Hasegawa-Johnson, Najim Dehak

Research output: Contribution to journalConference article

Abstract

Both human listeners and machines need to adapt their sound categories whenever a new speaker is encountered. This perceptual learning is driven by lexical information. The aim of this paper is two-fold: investigate whether a deep neural network-based (DNN) ASR system can adapt to only a few examples of ambiguous speech as humans have been found to do; investigate a DNN's ability to serve as a model of human perceptual learning. Crucially, we do so by looking at intermediate levels of phoneme category adaptation rather than at the output level. We visualize the activations in the hidden layers of the DNN during perceptual learning. The results show that, similar to humans, DNN systems learn speaker-adapted phone category boundaries from a few labeled examples. The DNN adapts its category boundaries not only by adapting the weights of the output layer, but also by adapting the implicit feature maps computed by the hidden layers, suggesting the possibility that human perceptual learning might involve a similar nonlinear distortion of a perceptual space that is intermediate between the acoustic input and the phonological categories. Comparisons between DNNs and humans can thus provide valuable insights into the way humans process speech and improve ASR technology.

Original languageEnglish (US)
Pages (from-to)1482-1486
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2018-September
DOIs
StatePublished - Jan 1 2018
Event19th Annual Conference of the International Speech Communication, INTERSPEECH 2018 - Hyderabad, India
Duration: Sep 2 2018Sep 6 2018

Fingerprint

Neural Networks
Nonlinear distortion
Nonlinear Distortion
Acoustics
Chemical activation
Acoustic waves
Output
Ambiguous
Human
Deep neural networks
Phoneme
Activation
Fold
Learning
Perceptual Learning
Layer
Speech
Intermediate

Keywords

  • Deep neural networks
  • Human perceptual learning
  • Phoneme category adaptation
  • Visualization

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Cite this

Visualizing phoneme category adaptation in deep neural networks. / Scharenborg, Odette; Tiesmeyer, Sebastian; Hasegawa-Johnson, Mark Allan; Dehak, Najim.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2018-September, 01.01.2018, p. 1482-1486.

Research output: Contribution to journalConference article

@article{f789017037554f3a89911ef9258916c8,
title = "Visualizing phoneme category adaptation in deep neural networks",
abstract = "Both human listeners and machines need to adapt their sound categories whenever a new speaker is encountered. This perceptual learning is driven by lexical information. The aim of this paper is two-fold: investigate whether a deep neural network-based (DNN) ASR system can adapt to only a few examples of ambiguous speech as humans have been found to do; investigate a DNN's ability to serve as a model of human perceptual learning. Crucially, we do so by looking at intermediate levels of phoneme category adaptation rather than at the output level. We visualize the activations in the hidden layers of the DNN during perceptual learning. The results show that, similar to humans, DNN systems learn speaker-adapted phone category boundaries from a few labeled examples. The DNN adapts its category boundaries not only by adapting the weights of the output layer, but also by adapting the implicit feature maps computed by the hidden layers, suggesting the possibility that human perceptual learning might involve a similar nonlinear distortion of a perceptual space that is intermediate between the acoustic input and the phonological categories. Comparisons between DNNs and humans can thus provide valuable insights into the way humans process speech and improve ASR technology.",
keywords = "Deep neural networks, Human perceptual learning, Phoneme category adaptation, Visualization",
author = "Odette Scharenborg and Sebastian Tiesmeyer and Hasegawa-Johnson, {Mark Allan} and Najim Dehak",
year = "2018",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2018-1707",
language = "English (US)",
volume = "2018-September",
pages = "1482--1486",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Visualizing phoneme category adaptation in deep neural networks

AU - Scharenborg, Odette

AU - Tiesmeyer, Sebastian

AU - Hasegawa-Johnson, Mark Allan

AU - Dehak, Najim

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Both human listeners and machines need to adapt their sound categories whenever a new speaker is encountered. This perceptual learning is driven by lexical information. The aim of this paper is two-fold: investigate whether a deep neural network-based (DNN) ASR system can adapt to only a few examples of ambiguous speech as humans have been found to do; investigate a DNN's ability to serve as a model of human perceptual learning. Crucially, we do so by looking at intermediate levels of phoneme category adaptation rather than at the output level. We visualize the activations in the hidden layers of the DNN during perceptual learning. The results show that, similar to humans, DNN systems learn speaker-adapted phone category boundaries from a few labeled examples. The DNN adapts its category boundaries not only by adapting the weights of the output layer, but also by adapting the implicit feature maps computed by the hidden layers, suggesting the possibility that human perceptual learning might involve a similar nonlinear distortion of a perceptual space that is intermediate between the acoustic input and the phonological categories. Comparisons between DNNs and humans can thus provide valuable insights into the way humans process speech and improve ASR technology.

AB - Both human listeners and machines need to adapt their sound categories whenever a new speaker is encountered. This perceptual learning is driven by lexical information. The aim of this paper is two-fold: investigate whether a deep neural network-based (DNN) ASR system can adapt to only a few examples of ambiguous speech as humans have been found to do; investigate a DNN's ability to serve as a model of human perceptual learning. Crucially, we do so by looking at intermediate levels of phoneme category adaptation rather than at the output level. We visualize the activations in the hidden layers of the DNN during perceptual learning. The results show that, similar to humans, DNN systems learn speaker-adapted phone category boundaries from a few labeled examples. The DNN adapts its category boundaries not only by adapting the weights of the output layer, but also by adapting the implicit feature maps computed by the hidden layers, suggesting the possibility that human perceptual learning might involve a similar nonlinear distortion of a perceptual space that is intermediate between the acoustic input and the phonological categories. Comparisons between DNNs and humans can thus provide valuable insights into the way humans process speech and improve ASR technology.

KW - Deep neural networks

KW - Human perceptual learning

KW - Phoneme category adaptation

KW - Visualization

UR - http://www.scopus.com/inward/record.url?scp=85054976311&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85054976311&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2018-1707

DO - 10.21437/Interspeech.2018-1707

M3 - Conference article

AN - SCOPUS:85054976311

VL - 2018-September

SP - 1482

EP - 1486

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -