Multimodal word discovery and retrieval with phone sequence and image concepts

Research output: Contribution to journalConference article

Abstract

This paper demonstrates three different systems capable of performing the multimodal word discovery task. A multimodal word discovery system accepts, as input, a database of spoken descriptions of images (or a set of corresponding phone transcripts), and learns a lexicon which is a mapping from phone strings to their associated image concepts. Three systems are demonstrated: one based on a statistical machine translation (SMT) model, two based on neural machine translation (NMT). On Flickr8k, the SMT-based model performs much better than the NMT-based one, achieving a 49.6% F1 score. Finally, we apply our word discovery system to the task of image retrieval and achieve 29.1% recall@10 on the standard 1000-image Flickr8k tests set.

Original languageEnglish (US)
Pages (from-to)2683-2687
Number of pages5
JournalProceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
Volume2019-September
DOIs
StatePublished - Jan 1 2019
Event20th Annual Conference of the International Speech Communication Association: Crossroads of Speech and Language, INTERSPEECH 2019 - Graz, Austria
Duration: Sep 15 2019Sep 19 2019

Fingerprint

Retrieval
Statistical Machine Translation
Machine Translation
Image retrieval
Image Retrieval
Test Set
Strings
Concepts
Phone
Model
Demonstrate

Keywords

  • Multimodal learning
  • Neural machine translation
  • Statistical machine translation
  • Unsupervised spoken word segmentation

ASJC Scopus subject areas

  • Language and Linguistics
  • Human-Computer Interaction
  • Signal Processing
  • Software
  • Modeling and Simulation

Cite this

@article{d9d288bdab64443d97e51e0cca1320ad,
title = "Multimodal word discovery and retrieval with phone sequence and image concepts",
abstract = "This paper demonstrates three different systems capable of performing the multimodal word discovery task. A multimodal word discovery system accepts, as input, a database of spoken descriptions of images (or a set of corresponding phone transcripts), and learns a lexicon which is a mapping from phone strings to their associated image concepts. Three systems are demonstrated: one based on a statistical machine translation (SMT) model, two based on neural machine translation (NMT). On Flickr8k, the SMT-based model performs much better than the NMT-based one, achieving a 49.6{\%} F1 score. Finally, we apply our word discovery system to the task of image retrieval and achieve 29.1{\%} recall@10 on the standard 1000-image Flickr8k tests set.",
keywords = "Multimodal learning, Neural machine translation, Statistical machine translation, Unsupervised spoken word segmentation",
author = "Liming Wang and Mark Hasegawa-Johnson",
year = "2019",
month = "1",
day = "1",
doi = "10.21437/Interspeech.2019-1487",
language = "English (US)",
volume = "2019-September",
pages = "2683--2687",
journal = "Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH",
issn = "2308-457X",

}

TY - JOUR

T1 - Multimodal word discovery and retrieval with phone sequence and image concepts

AU - Wang, Liming

AU - Hasegawa-Johnson, Mark

PY - 2019/1/1

Y1 - 2019/1/1

N2 - This paper demonstrates three different systems capable of performing the multimodal word discovery task. A multimodal word discovery system accepts, as input, a database of spoken descriptions of images (or a set of corresponding phone transcripts), and learns a lexicon which is a mapping from phone strings to their associated image concepts. Three systems are demonstrated: one based on a statistical machine translation (SMT) model, two based on neural machine translation (NMT). On Flickr8k, the SMT-based model performs much better than the NMT-based one, achieving a 49.6% F1 score. Finally, we apply our word discovery system to the task of image retrieval and achieve 29.1% recall@10 on the standard 1000-image Flickr8k tests set.

AB - This paper demonstrates three different systems capable of performing the multimodal word discovery task. A multimodal word discovery system accepts, as input, a database of spoken descriptions of images (or a set of corresponding phone transcripts), and learns a lexicon which is a mapping from phone strings to their associated image concepts. Three systems are demonstrated: one based on a statistical machine translation (SMT) model, two based on neural machine translation (NMT). On Flickr8k, the SMT-based model performs much better than the NMT-based one, achieving a 49.6% F1 score. Finally, we apply our word discovery system to the task of image retrieval and achieve 29.1% recall@10 on the standard 1000-image Flickr8k tests set.

KW - Multimodal learning

KW - Neural machine translation

KW - Statistical machine translation

KW - Unsupervised spoken word segmentation

UR - http://www.scopus.com/inward/record.url?scp=85074728168&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85074728168&partnerID=8YFLogxK

U2 - 10.21437/Interspeech.2019-1487

DO - 10.21437/Interspeech.2019-1487

M3 - Conference article

AN - SCOPUS:85074728168

VL - 2019-September

SP - 2683

EP - 2687

JO - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

JF - Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH

SN - 2308-457X

ER -