Multimodal word discovery and retrieval with phone sequence and image concepts

Research output: Contribution to journalConference articlepeer-review


This paper demonstrates three different systems capable of performing the multimodal word discovery task. A multimodal word discovery system accepts, as input, a database of spoken descriptions of images (or a set of corresponding phone transcripts), and learns a lexicon which is a mapping from phone strings to their associated image concepts. Three systems are demonstrated: one based on a statistical machine translation (SMT) model, two based on neural machine translation (NMT). On Flickr8k, the SMT-based model performs much better than the NMT-based one, achieving a 49.6% F1 score. Finally, we apply our word discovery system to the task of image retrieval and achieve 29.1% recall@10 on the standard 1000-image Flickr8k tests set.


  • Multimodal learning
  • Neural machine translation
  • Statistical machine translation
  • Unsupervised spoken word segmentation

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation


Dive into the research topics of 'Multimodal word discovery and retrieval with phone sequence and image concepts'. Together they form a unique fingerprint.

Cite this