Matching Words and Pictures

Kobus Barnard, Pinar Duygulu, David Alexander Forsyth, Nando De Freitas, David M. Blei, Michael I. Jordan

Research output: Contribution to journalArticle

Abstract

We present a new approach for modeling multi-modal data sets, focusing on the specific case of segmented images with associated text. Learning the joint distribution of image regions and words has many applications. We consider in detail predicting words associated with whole images (auto-annotation) and corresponding to particular image regions (region naming). Auto-annotation might help organize and access large collections of images. Region naming is a model of object recognition as a process of translating image regions to words, much as one might translate from one language to another. Learning the relationships between image regions and semantic correlates (words) is an interesting example of multi-modal data mining, particularly because it is typically hard to apply data mining techniques to collections of images. We develop a number of models for the joint distribution of image regions and words, including several which explicitly learn the correspondence between regions and words. We study multi-modal and correspondence extensions to Hofmann's hierarchical clustering/aspect model, a translation model adapted from statistical machine translation (Brown et al.), and a multi-modal extension to mixture of latent Dirichlet allocation (MoM-LDA). All models are assessed using a large collection of annotated images of real scenes. We study in depth the difficult problem of measuring performance. For the annotation task, we look at prediction performance on held out data. We present three alternative measures, oriented toward different types of task. Measuring the performance of correspondence methods is harder, because one must determine whether a word has been placed on the right region of an image. We can use annotation performance as a proxy measure, but accurate measurement requires hand labeled data, and thus must occur on a smaller scale. We show results using both an annotation proxy, and manually labeled data.

Original languageEnglish (US)
Pages (from-to)1107-1135
Number of pages29
JournalJournal of Machine Learning Research
Volume3
Issue number6
DOIs
StatePublished - Aug 15 2003
Externally publishedYes

Fingerprint

Annotation
Data mining
Object recognition
Correspondence
Joint Distribution
Data Mining
Semantics
Statistical Machine Translation
Model
Hierarchical Clustering
Performance Prediction
Object Recognition
Correlate
Dirichlet
Alternatives
Modeling
Learning

ASJC Scopus subject areas

  • Software
  • Control and Systems Engineering
  • Statistics and Probability
  • Artificial Intelligence

Cite this

Barnard, K., Duygulu, P., Forsyth, D. A., De Freitas, N., Blei, D. M., & Jordan, M. I. (2003). Matching Words and Pictures. Journal of Machine Learning Research, 3(6), 1107-1135. https://doi.org/10.1162/153244303322533214

Matching Words and Pictures. / Barnard, Kobus; Duygulu, Pinar; Forsyth, David Alexander; De Freitas, Nando; Blei, David M.; Jordan, Michael I.

In: Journal of Machine Learning Research, Vol. 3, No. 6, 15.08.2003, p. 1107-1135.

Research output: Contribution to journalArticle

Barnard, K, Duygulu, P, Forsyth, DA, De Freitas, N, Blei, DM & Jordan, MI 2003, 'Matching Words and Pictures', Journal of Machine Learning Research, vol. 3, no. 6, pp. 1107-1135. https://doi.org/10.1162/153244303322533214
Barnard K, Duygulu P, Forsyth DA, De Freitas N, Blei DM, Jordan MI. Matching Words and Pictures. Journal of Machine Learning Research. 2003 Aug 15;3(6):1107-1135. https://doi.org/10.1162/153244303322533214
Barnard, Kobus ; Duygulu, Pinar ; Forsyth, David Alexander ; De Freitas, Nando ; Blei, David M. ; Jordan, Michael I. / Matching Words and Pictures. In: Journal of Machine Learning Research. 2003 ; Vol. 3, No. 6. pp. 1107-1135.
@article{e7034321025044b0ba472b9c1f544618,
title = "Matching Words and Pictures",
abstract = "We present a new approach for modeling multi-modal data sets, focusing on the specific case of segmented images with associated text. Learning the joint distribution of image regions and words has many applications. We consider in detail predicting words associated with whole images (auto-annotation) and corresponding to particular image regions (region naming). Auto-annotation might help organize and access large collections of images. Region naming is a model of object recognition as a process of translating image regions to words, much as one might translate from one language to another. Learning the relationships between image regions and semantic correlates (words) is an interesting example of multi-modal data mining, particularly because it is typically hard to apply data mining techniques to collections of images. We develop a number of models for the joint distribution of image regions and words, including several which explicitly learn the correspondence between regions and words. We study multi-modal and correspondence extensions to Hofmann's hierarchical clustering/aspect model, a translation model adapted from statistical machine translation (Brown et al.), and a multi-modal extension to mixture of latent Dirichlet allocation (MoM-LDA). All models are assessed using a large collection of annotated images of real scenes. We study in depth the difficult problem of measuring performance. For the annotation task, we look at prediction performance on held out data. We present three alternative measures, oriented toward different types of task. Measuring the performance of correspondence methods is harder, because one must determine whether a word has been placed on the right region of an image. We can use annotation performance as a proxy measure, but accurate measurement requires hand labeled data, and thus must occur on a smaller scale. We show results using both an annotation proxy, and manually labeled data.",
author = "Kobus Barnard and Pinar Duygulu and Forsyth, {David Alexander} and {De Freitas}, Nando and Blei, {David M.} and Jordan, {Michael I.}",
year = "2003",
month = "8",
day = "15",
doi = "10.1162/153244303322533214",
language = "English (US)",
volume = "3",
pages = "1107--1135",
journal = "Journal of Machine Learning Research",
issn = "1532-4435",
publisher = "Microtome Publishing",
number = "6",

}

TY - JOUR

T1 - Matching Words and Pictures

AU - Barnard, Kobus

AU - Duygulu, Pinar

AU - Forsyth, David Alexander

AU - De Freitas, Nando

AU - Blei, David M.

AU - Jordan, Michael I.

PY - 2003/8/15

Y1 - 2003/8/15

N2 - We present a new approach for modeling multi-modal data sets, focusing on the specific case of segmented images with associated text. Learning the joint distribution of image regions and words has many applications. We consider in detail predicting words associated with whole images (auto-annotation) and corresponding to particular image regions (region naming). Auto-annotation might help organize and access large collections of images. Region naming is a model of object recognition as a process of translating image regions to words, much as one might translate from one language to another. Learning the relationships between image regions and semantic correlates (words) is an interesting example of multi-modal data mining, particularly because it is typically hard to apply data mining techniques to collections of images. We develop a number of models for the joint distribution of image regions and words, including several which explicitly learn the correspondence between regions and words. We study multi-modal and correspondence extensions to Hofmann's hierarchical clustering/aspect model, a translation model adapted from statistical machine translation (Brown et al.), and a multi-modal extension to mixture of latent Dirichlet allocation (MoM-LDA). All models are assessed using a large collection of annotated images of real scenes. We study in depth the difficult problem of measuring performance. For the annotation task, we look at prediction performance on held out data. We present three alternative measures, oriented toward different types of task. Measuring the performance of correspondence methods is harder, because one must determine whether a word has been placed on the right region of an image. We can use annotation performance as a proxy measure, but accurate measurement requires hand labeled data, and thus must occur on a smaller scale. We show results using both an annotation proxy, and manually labeled data.

AB - We present a new approach for modeling multi-modal data sets, focusing on the specific case of segmented images with associated text. Learning the joint distribution of image regions and words has many applications. We consider in detail predicting words associated with whole images (auto-annotation) and corresponding to particular image regions (region naming). Auto-annotation might help organize and access large collections of images. Region naming is a model of object recognition as a process of translating image regions to words, much as one might translate from one language to another. Learning the relationships between image regions and semantic correlates (words) is an interesting example of multi-modal data mining, particularly because it is typically hard to apply data mining techniques to collections of images. We develop a number of models for the joint distribution of image regions and words, including several which explicitly learn the correspondence between regions and words. We study multi-modal and correspondence extensions to Hofmann's hierarchical clustering/aspect model, a translation model adapted from statistical machine translation (Brown et al.), and a multi-modal extension to mixture of latent Dirichlet allocation (MoM-LDA). All models are assessed using a large collection of annotated images of real scenes. We study in depth the difficult problem of measuring performance. For the annotation task, we look at prediction performance on held out data. We present three alternative measures, oriented toward different types of task. Measuring the performance of correspondence methods is harder, because one must determine whether a word has been placed on the right region of an image. We can use annotation performance as a proxy measure, but accurate measurement requires hand labeled data, and thus must occur on a smaller scale. We show results using both an annotation proxy, and manually labeled data.

UR - http://www.scopus.com/inward/record.url?scp=0041876117&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0041876117&partnerID=8YFLogxK

U2 - 10.1162/153244303322533214

DO - 10.1162/153244303322533214

M3 - Article

AN - SCOPUS:0041876117

VL - 3

SP - 1107

EP - 1135

JO - Journal of Machine Learning Research

JF - Journal of Machine Learning Research

SN - 1532-4435

IS - 6

ER -