Learning speaker aware offsets for speaker adaptation of neural networks

Leda Sarı, Samuel Thomas, Mark Hasegawa-Johnson

Research output: Contribution to journalConference article

Abstract

In this work, we present an unsupervised long short-term memory (LSTM) layer normalization technique that we call adaptation by speaker aware offsets (ASAO). These offsets are learned using an auxiliary network attached to the main senone classifier. The auxiliary network takes main network LSTM activations as input and tries to reconstruct speaker, (speaker,phone) and (speaker,senone)-level averages of the activations by minimizing the mean-squared error. Once the auxiliary network is jointly trained with the main network, during test time we do not need additional information for the test data as the network will generate the offset itself. Unlike many speaker adaptation studies which only adapt fully connected layers, our method is applicable to LSTM layers in addition to fully-connected layers. In our experiments, we investigate the effect of ASAO of LSTM layers at different depths. We also show its performance when the inputs are already speaker adapted by feature space maximum likelihood linear regression (fMLLR). In addition, we compare ASAO with a speaker adversarial training framework. ASAO achieves higher senone classification accuracy and lower word error rate (WER) than both the unadapted models and the adversarial model on the HUB4 dataset, with an absolute WER reduction of up to 2%.

Fingerprint

Speaker Adaptation
Memory Term
Neural Networks
Neural networks
Chemical activation
Error Rate
Activation
Linear regression
Maximum likelihood
Classifiers
Feature Space
Mean Squared Error
Normalization
Maximum Likelihood
Learning
Long short-term memory
Classifier
Experiments
Model
Experiment

Keywords

  • Neural networks
  • Speaker adaptation
  • Speech recognition

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Cite this

Learning speaker aware offsets for speaker adaptation of neural networks. / Sarı, Leda; Thomas, Samuel; Hasegawa-Johnson, Mark.

In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Vol. 2019-September, 01.01.2019, p. 769-773.

Research output: Contribution to journalConference article

@article{be0caac0d5a3426099ca3585b07e6281,
title = "Learning speaker aware offsets for speaker adaptation of neural networks",
abstract = "In this work, we present an unsupervised long short-term memory (LSTM) layer normalization technique that we call adaptation by speaker aware offsets (ASAO). These offsets are learned using an auxiliary network attached to the main senone classifier. The auxiliary network takes main network LSTM activations as input and tries to reconstruct speaker, (speaker,phone) and (speaker,senone)-level averages of the activations by minimizing the mean-squared error. Once the auxiliary network is jointly trained with the main network, during test time we do not need additional information for the test data as the network will generate the offset itself. Unlike many speaker adaptation studies which only adapt fully connected layers, our method is applicable to LSTM layers in addition to fully-connected layers. In our experiments, we investigate the effect of ASAO of LSTM layers at different depths. We also show its performance when the inputs are already speaker adapted by feature space maximum likelihood linear regression (fMLLR). In addition, we compare ASAO with a speaker adversarial training framework. ASAO achieves higher senone classification accuracy and lower word error rate (WER) than both the unadapted models and the adversarial model on the HUB4 dataset, with an absolute WER reduction of up to 2{\%}.",
keywords = "Neural networks, Speaker adaptation, Speech recognition",
author = "Leda Sarı and Samuel Thomas and Mark Hasegawa-Johnson",
year = "2019",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2019-1788",
language = "English (US)",
volume = "2019-September",
pages = "769--773",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Learning speaker aware offsets for speaker adaptation of neural networks

AU - Sarı, Leda

AU - Thomas, Samuel

AU - Hasegawa-Johnson, Mark

PY - 2019/1/1

Y1 - 2019/1/1

N2 - In this work, we present an unsupervised long short-term memory (LSTM) layer normalization technique that we call adaptation by speaker aware offsets (ASAO). These offsets are learned using an auxiliary network attached to the main senone classifier. The auxiliary network takes main network LSTM activations as input and tries to reconstruct speaker, (speaker,phone) and (speaker,senone)-level averages of the activations by minimizing the mean-squared error. Once the auxiliary network is jointly trained with the main network, during test time we do not need additional information for the test data as the network will generate the offset itself. Unlike many speaker adaptation studies which only adapt fully connected layers, our method is applicable to LSTM layers in addition to fully-connected layers. In our experiments, we investigate the effect of ASAO of LSTM layers at different depths. We also show its performance when the inputs are already speaker adapted by feature space maximum likelihood linear regression (fMLLR). In addition, we compare ASAO with a speaker adversarial training framework. ASAO achieves higher senone classification accuracy and lower word error rate (WER) than both the unadapted models and the adversarial model on the HUB4 dataset, with an absolute WER reduction of up to 2%.

AB - In this work, we present an unsupervised long short-term memory (LSTM) layer normalization technique that we call adaptation by speaker aware offsets (ASAO). These offsets are learned using an auxiliary network attached to the main senone classifier. The auxiliary network takes main network LSTM activations as input and tries to reconstruct speaker, (speaker,phone) and (speaker,senone)-level averages of the activations by minimizing the mean-squared error. Once the auxiliary network is jointly trained with the main network, during test time we do not need additional information for the test data as the network will generate the offset itself. Unlike many speaker adaptation studies which only adapt fully connected layers, our method is applicable to LSTM layers in addition to fully-connected layers. In our experiments, we investigate the effect of ASAO of LSTM layers at different depths. We also show its performance when the inputs are already speaker adapted by feature space maximum likelihood linear regression (fMLLR). In addition, we compare ASAO with a speaker adversarial training framework. ASAO achieves higher senone classification accuracy and lower word error rate (WER) than both the unadapted models and the adversarial model on the HUB4 dataset, with an absolute WER reduction of up to 2%.

KW - Neural networks

KW - Speaker adaptation

KW - Speech recognition

UR - http://www.scopus.com/inward/record.url?scp=85074682557&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85074682557&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2019-1788

DO - 10.21437/Interspeech.2019-1788

M3 - Conference article

AN - SCOPUS:85074682557

VL - 2019-September

SP - 769

EP - 773

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -