Multimodal Word Discovery and Retrieval with Spoken Descriptions and Visual Concepts

Research output: Contribution to journalArticlepeer-review


In the absence of dictionaries, translators, or grammars, it is still possible to learn some of the words of a new language by listening to spoken descriptions of images. If several images, each containing a particular visually salient object, each co-occur with a particular sequence of speech sounds, we can infer that those speech sounds are a word whose definition is the visible object. A multimodal word discovery system accepts, as input, a database of spoken descriptions of images (or a set of corresponding phone transcriptions) and learns a mapping from waveform segments (or phone strings) to their associated image concepts. In this article, four multimodal word discovery systems are demonstrated: three models based on statistical machine translation (SMT) and one based on neural machine translation (NMT). The systems are trained with phonetic transcriptions, MFCC and multilingual bottleneck features (MBN). On the phone-level, the SMT outperforms the NMT model, achieving a 61.6% F1 score in the phone-level word discovery task on Flickr30k. On the audio-level, we compared our models with the existing ES-KMeans algorithm for word discovery and present some of the challenges in multimodal spoken word discovery.

Original languageEnglish (US)
Article number9097433
Pages (from-to)1560-1573
Number of pages14
JournalIEEE/ACM Transactions on Audio Speech and Language Processing
StatePublished - 2020


  • Unsupervised word discovery
  • language acquisition
  • machine translation
  • multimodal learning

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Acoustics and Ultrasonics
  • Computational Mathematics
  • Electrical and Electronic Engineering


Dive into the research topics of 'Multimodal Word Discovery and Retrieval with Spoken Descriptions and Visual Concepts'. Together they form a unique fingerprint.

Cite this