TY - JOUR
T1 - Auxiliary Networks for Joint Speaker Adaptation and Speaker Change Detection
AU - Sari, Leda
AU - Hasegawa-Johnson, Mark
AU - Thomas, Samuel
N1 - Funding Information:
Manuscript received March 21, 2020; revised August 18, 2020; accepted November 11, 2020. Date of publication November 25, 2020; date of current version December 14, 2020. The work of Leda Sari was supported by NSF under Grant IIS 19-10 319. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Lei Xie. (Corresponding author: Leda Sari.) Leda Sari and Mark Hasegawa-Johnson are with the Department of Electrical, and Computer Engineering, and Beckman Institute, University of Illinois at Urbana-Champaign, Champaign, IL 61801 USA (e-mail: lsari2@illinois.edu; jhasegaw@illinois.edu).
Publisher Copyright:
© 2014 IEEE.
PY - 2021
Y1 - 2021
N2 - Speaker adaptation and speaker change detection have both been studied extensively to improve automatic speech recognition (ASR). In many cases, these two problems are investigated separately: speaker change detection is implemented first to obtain single-speaker regions, and speaker adaptation is then performed using the derived speaker segments for improved ASR. However, in an online setting, we want to achieve both goals in a single pass. In this study, we propose a neural network architecture that learns a speaker embedding from which it can perform both speaker adaptation for ASR and speaker change detection. The proposed speaker embedding is computed using self-attention based on an auxiliary network attached to a main ASR network. ASR adaptation is then performed by subtracting, from the main network activations, a segment dependent affine transformation of the learned speaker embedding. In experiments on a broadcast news dataset and the Switchboard conversational dataset, we test our system on utterances with a change point in them and show that the proposed method achieves significantly better performance as compared to the unadapted main network (10-14% relative reduction in word error rate (WER)). The proposed architecture also outperforms three different speaker segmentation methods followed by ASR (around 10% relative reduction in WER).
AB - Speaker adaptation and speaker change detection have both been studied extensively to improve automatic speech recognition (ASR). In many cases, these two problems are investigated separately: speaker change detection is implemented first to obtain single-speaker regions, and speaker adaptation is then performed using the derived speaker segments for improved ASR. However, in an online setting, we want to achieve both goals in a single pass. In this study, we propose a neural network architecture that learns a speaker embedding from which it can perform both speaker adaptation for ASR and speaker change detection. The proposed speaker embedding is computed using self-attention based on an auxiliary network attached to a main ASR network. ASR adaptation is then performed by subtracting, from the main network activations, a segment dependent affine transformation of the learned speaker embedding. In experiments on a broadcast news dataset and the Switchboard conversational dataset, we test our system on utterances with a change point in them and show that the proposed method achieves significantly better performance as compared to the unadapted main network (10-14% relative reduction in word error rate (WER)). The proposed architecture also outperforms three different speaker segmentation methods followed by ASR (around 10% relative reduction in WER).
KW - Speaker adaptation
KW - automatic speech recognition
KW - speaker change detection
KW - speaker segmentation
UR - http://www.scopus.com/inward/record.url?scp=85097161945&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85097161945&partnerID=8YFLogxK
U2 - 10.1109/TASLP.2020.3040626
DO - 10.1109/TASLP.2020.3040626
M3 - Article
AN - SCOPUS:85097161945
SN - 2329-9290
VL - 29
SP - 324
EP - 333
JO - IEEE/ACM Transactions on Audio Speech and Language Processing
JF - IEEE/ACM Transactions on Audio Speech and Language Processing
M1 - 9271936
ER -