TY - GEN
T1 - A DNN-HMM-DNN hybrid model for discovering word-like units from spoken captions and image regions
AU - Wang, Liming
AU - Hasegawa-Johnson, Mark
N1 - Publisher Copyright:
Copyright © 2020 ISCA
PY - 2020
Y1 - 2020
N2 - Discovering word-like units without textual transcriptions is an important step in low-resource speech technology. In this work, we demonstrate a model inspired by statistical machine translation and hidden Markov model/deep neural network (HMMDNN) hybrid systems. Our learning algorithm is capable of discovering the visual and acoustic correlates of K distinct words in an unknown language by simultaneously learning the mapping from image regions to concepts (the first DNN), the mapping from acoustic feature vectors to phones (the second DNN), and the optimum alignment between the two (the HMM). In the simulated low-resource setting using MSCOCO and SpeechCOCO datasets, our model achieves 62.4 % alignment accuracy and outperforms the audio-only segmental embedded GMM approach on standard word discovery evaluation metrics.
AB - Discovering word-like units without textual transcriptions is an important step in low-resource speech technology. In this work, we demonstrate a model inspired by statistical machine translation and hidden Markov model/deep neural network (HMMDNN) hybrid systems. Our learning algorithm is capable of discovering the visual and acoustic correlates of K distinct words in an unknown language by simultaneously learning the mapping from image regions to concepts (the first DNN), the mapping from acoustic feature vectors to phones (the second DNN), and the optimum alignment between the two (the HMM). In the simulated low-resource setting using MSCOCO and SpeechCOCO datasets, our model achieves 62.4 % alignment accuracy and outperforms the audio-only segmental embedded GMM approach on standard word discovery evaluation metrics.
KW - Language acquisition
KW - Machine translation
KW - Multimodal learning
KW - Unsupervised spoken word discovery
UR - http://www.scopus.com/inward/record.url?scp=85098165215&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85098165215&partnerID=8YFLogxK
U2 - 10.21437/Interspeech.2020-1148
DO - 10.21437/Interspeech.2020-1148
M3 - Conference contribution
AN - SCOPUS:85098165215
SN - 9781713820697
T3 - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
SP - 1456
EP - 1460
BT - Interspeech 2020
PB - International Speech Communication Association
T2 - 21st Annual Conference of the International Speech Communication Association, INTERSPEECH 2020
Y2 - 25 October 2020 through 29 October 2020
ER -