Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection

Leda Sari, Mark Hasegawa-Johnson, Samuel Thomas

Research output: Contribution to journalArticlepeer-review


Speaker adaptation and speaker change detection have both been studied extensively to improve automatic speech recognition (ASR). In many cases, these two problems are investigated separately: speaker change detection is implemented first to obtain single-speaker regions, and speaker adaptation is then performed using the derived speaker segments for improved ASR. However, in an online setting, we want to achieve both goals in a single pass. In this study, we propose a neural network architecture that learns a speaker embedding from which it can perform both speaker adaptation for ASR and speaker change detection. The proposed speaker embedding is computed using self-attention based on an auxiliary network attached to a main ASR network. ASR adaptation is then performed by subtracting, from the main network activations, a segment dependent affine transformation of the learned speaker embedding. In experiments on a broadcast news dataset and the Switchboard conversational dataset, we test our system on utterances with a change point in them and show that the proposed method achieves significantly better performance as compared to the unadapted main network (10-14% relative reduction in word error rate (WER)). The proposed architecture also outperforms three different speaker segmentation methods followed by ASR (around 10% relative reduction in WER).

Original languageEnglish (US)
Article number9271936
Pages (from-to)324-333
Number of pages10
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
StatePublished - 2021


  • Speaker adaptation
  • automatic speech recognition
  • speaker change detection
  • speaker segmentation

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering


Dive into the research topics of 'Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection'. Together they form a unique fingerprint.

Cite this