Abstract
Speaker adaptation and speaker change detection have both been studied extensively to improve automatic speech recognition (ASR). In many cases, these two problems are investigated separately: speaker change detection is implemented first to obtain single-speaker regions, and speaker adaptation is then performed using the derived speaker segments for improved ASR. However, in an online setting, we want to achieve both goals in a single pass. In this study, we propose a neural network architecture that learns a speaker embedding from which it can perform both speaker adaptation for ASR and speaker change detection. The proposed speaker embedding is computed using self-attention based on an auxiliary network attached to a main ASR network. ASR adaptation is then performed by subtracting, from the main network activations, a segment dependent affine transformation of the learned speaker embedding. In experiments on a broadcast news dataset and the Switchboard conversational dataset, we test our system on utterances with a change point in them and show that the proposed method achieves significantly better performance as compared to the unadapted main network (10-14% relative reduction in word error rate (WER)). The proposed architecture also outperforms three different speaker segmentation methods followed by ASR (around 10% relative reduction in WER).
Original language | English (US) |
---|---|
Article number | 9271936 |
Pages (from-to) | 324-333 |
Number of pages | 10 |
Journal | IEEE/ACM Transactions on Audio Speech and Language Processing |
Volume | 29 |
DOIs | |
State | Published - 2021 |
Keywords
- Speaker adaptation
- automatic speech recognition
- speaker change detection
- speaker segmentation
ASJC Scopus subject areas
- Computer Science (miscellaneous)
- Acoustics and Ultrasonics
- Computational Mathematics
- Electrical and Electronic Engineering