TY - GEN
T1 - Training spoken language understanding systems with non-parallel speech and text
AU - Sari, Leda
AU - Thomas, Samuel
AU - Hasegawa-Johnson, Mark
N1 - Publisher Copyright:
© 2020 Institute of Electrical and Electronics Engineers Inc.. All rights reserved.
PY - 2020/5
Y1 - 2020/5
N2 - End-to-end spoken language understanding (SLU) systems are typically trained on large amounts of data. In many practical scenarios, the amount of labeled speech is often limited as opposed to text. In this study, we investigate the use of non-parallel speech and text to improve the performance of dialog act recognition as an example SLU task. We propose a multiview architecture that can handle each modality separately. To effectively train on such data, this model enforces the internal speech and text encodings to be similar using a shared classifier. On the Switchboard Dialog Act corpus, we show that pretraining the classifier using large amounts of text helps learning better speech encodings, resulting in up to 40% relatively higher classification accuracies. We also show that when the speech embeddings from an automatic speech recognition (ASR) system are used in this framework, the speech-only accuracy exceeds the performance of ASR-text based tests up to 15% relative and approaches the performance of using true transcripts.
AB - End-to-end spoken language understanding (SLU) systems are typically trained on large amounts of data. In many practical scenarios, the amount of labeled speech is often limited as opposed to text. In this study, we investigate the use of non-parallel speech and text to improve the performance of dialog act recognition as an example SLU task. We propose a multiview architecture that can handle each modality separately. To effectively train on such data, this model enforces the internal speech and text encodings to be similar using a shared classifier. On the Switchboard Dialog Act corpus, we show that pretraining the classifier using large amounts of text helps learning better speech encodings, resulting in up to 40% relatively higher classification accuracies. We also show that when the speech embeddings from an automatic speech recognition (ASR) system are used in this framework, the speech-only accuracy exceeds the performance of ASR-text based tests up to 15% relative and approaches the performance of using true transcripts.
KW - Dialog act recognition
KW - Multiview training
KW - Non-parallel data
KW - Spoken language understanding
UR - http://www.scopus.com/inward/record.url?scp=85091318275&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85091318275&partnerID=8YFLogxK
U2 - 10.1109/ICASSP40776.2020.9054664
DO - 10.1109/ICASSP40776.2020.9054664
M3 - Conference contribution
AN - SCOPUS:85091318275
T3 - ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
SP - 8109
EP - 8113
BT - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020 - Proceedings
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2020
Y2 - 4 May 2020 through 8 May 2020
ER -