Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary

P. Duygulu, K. Barnard, J. F.G. de Freitas, D. A. Forsyth

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We describe a model of object recognition as machine translation. In this model, recognition is a process of annotating image regions with words. Firstly, images are segmented into regions, which are classified into region types using a variety of features. A mapping between region types and keywords supplied with the images, is then learned, using a method based around EM. This process is analogous with learning a lexicon from an aligned bitext. For the implementation we describe, these words are nouns taken from a large vocabulary. On a large test set, the method can predict numerous words with high accuracy. Simple methods identify words that cannot be predicted well. We show how to cluster words that individually are difficult to predict into clusters that can be predicted well — for example, we cannot predict the distinction between train and locomotive using the current set of features, but we can predict the underlying concept. The method is trained on a substantial collection of images. Extensive experimental results illustrate the strengths and weaknesses of the approach.

Original languageEnglish (US)
Title of host publicationComputer Vision - ECCV 2002 - 7th European Conference on Computer Vision, Proceedings
EditorsMads Nielsen, Anders Heyden, Gunnar Sparr, Peter Johansen
PublisherSpringer-Verlag
Pages97-112
Number of pages16
ISBN (Electronic)9783540437482
StatePublished - Jan 1 2002
Externally publishedYes
Event7th European Conference on Computer Vision, ECCV 2002 - Copenhagen, Denmark
Duration: May 28 2002May 31 2002

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume2353
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other7th European Conference on Computer Vision, ECCV 2002
CountryDenmark
CityCopenhagen
Period5/28/025/31/02

Fingerprint

Machine Translation
Object recognition
Object Recognition
Predict
Locomotives
Test Set
Large Set
High Accuracy
Learning
Experimental Results
Model

Keywords

  • Correspondence
  • EM algorithm
  • Object recognition

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Duygulu, P., Barnard, K., de Freitas, J. F. G., & Forsyth, D. A. (2002). Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In M. Nielsen, A. Heyden, G. Sparr, & P. Johansen (Eds.), Computer Vision - ECCV 2002 - 7th European Conference on Computer Vision, Proceedings (pp. 97-112). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2353). Springer-Verlag.

Object recognition as machine translation : Learning a lexicon for a fixed image vocabulary. / Duygulu, P.; Barnard, K.; de Freitas, J. F.G.; Forsyth, D. A.

Computer Vision - ECCV 2002 - 7th European Conference on Computer Vision, Proceedings. ed. / Mads Nielsen; Anders Heyden; Gunnar Sparr; Peter Johansen. Springer-Verlag, 2002. p. 97-112 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 2353).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Duygulu, P, Barnard, K, de Freitas, JFG & Forsyth, DA 2002, Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. in M Nielsen, A Heyden, G Sparr & P Johansen (eds), Computer Vision - ECCV 2002 - 7th European Conference on Computer Vision, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 2353, Springer-Verlag, pp. 97-112, 7th European Conference on Computer Vision, ECCV 2002, Copenhagen, Denmark, 5/28/02.
Duygulu P, Barnard K, de Freitas JFG, Forsyth DA. Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Nielsen M, Heyden A, Sparr G, Johansen P, editors, Computer Vision - ECCV 2002 - 7th European Conference on Computer Vision, Proceedings. Springer-Verlag. 2002. p. 97-112. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
Duygulu, P. ; Barnard, K. ; de Freitas, J. F.G. ; Forsyth, D. A. / Object recognition as machine translation : Learning a lexicon for a fixed image vocabulary. Computer Vision - ECCV 2002 - 7th European Conference on Computer Vision, Proceedings. editor / Mads Nielsen ; Anders Heyden ; Gunnar Sparr ; Peter Johansen. Springer-Verlag, 2002. pp. 97-112 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{c999854ea9d24c47987cba79c0c0586a,
title = "Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary",
abstract = "We describe a model of object recognition as machine translation. In this model, recognition is a process of annotating image regions with words. Firstly, images are segmented into regions, which are classified into region types using a variety of features. A mapping between region types and keywords supplied with the images, is then learned, using a method based around EM. This process is analogous with learning a lexicon from an aligned bitext. For the implementation we describe, these words are nouns taken from a large vocabulary. On a large test set, the method can predict numerous words with high accuracy. Simple methods identify words that cannot be predicted well. We show how to cluster words that individually are difficult to predict into clusters that can be predicted well — for example, we cannot predict the distinction between train and locomotive using the current set of features, but we can predict the underlying concept. The method is trained on a substantial collection of images. Extensive experimental results illustrate the strengths and weaknesses of the approach.",
keywords = "Correspondence, EM algorithm, Object recognition",
author = "P. Duygulu and K. Barnard and {de Freitas}, {J. F.G.} and Forsyth, {D. A.}",
year = "2002",
month = "1",
day = "1",
language = "English (US)",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
publisher = "Springer-Verlag",
pages = "97--112",
editor = "Mads Nielsen and Anders Heyden and Gunnar Sparr and Peter Johansen",
booktitle = "Computer Vision - ECCV 2002 - 7th European Conference on Computer Vision, Proceedings",

}

TY - GEN

T1 - Object recognition as machine translation

T2 - Learning a lexicon for a fixed image vocabulary

AU - Duygulu, P.

AU - Barnard, K.

AU - de Freitas, J. F.G.

AU - Forsyth, D. A.

PY - 2002/1/1

Y1 - 2002/1/1

N2 - We describe a model of object recognition as machine translation. In this model, recognition is a process of annotating image regions with words. Firstly, images are segmented into regions, which are classified into region types using a variety of features. A mapping between region types and keywords supplied with the images, is then learned, using a method based around EM. This process is analogous with learning a lexicon from an aligned bitext. For the implementation we describe, these words are nouns taken from a large vocabulary. On a large test set, the method can predict numerous words with high accuracy. Simple methods identify words that cannot be predicted well. We show how to cluster words that individually are difficult to predict into clusters that can be predicted well — for example, we cannot predict the distinction between train and locomotive using the current set of features, but we can predict the underlying concept. The method is trained on a substantial collection of images. Extensive experimental results illustrate the strengths and weaknesses of the approach.

AB - We describe a model of object recognition as machine translation. In this model, recognition is a process of annotating image regions with words. Firstly, images are segmented into regions, which are classified into region types using a variety of features. A mapping between region types and keywords supplied with the images, is then learned, using a method based around EM. This process is analogous with learning a lexicon from an aligned bitext. For the implementation we describe, these words are nouns taken from a large vocabulary. On a large test set, the method can predict numerous words with high accuracy. Simple methods identify words that cannot be predicted well. We show how to cluster words that individually are difficult to predict into clusters that can be predicted well — for example, we cannot predict the distinction between train and locomotive using the current set of features, but we can predict the underlying concept. The method is trained on a substantial collection of images. Extensive experimental results illustrate the strengths and weaknesses of the approach.

KW - Correspondence

KW - EM algorithm

KW - Object recognition

UR - http://www.scopus.com/inward/record.url?scp=84937572644&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84937572644&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:84937572644

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 97

EP - 112

BT - Computer Vision - ECCV 2002 - 7th European Conference on Computer Vision, Proceedings

A2 - Nielsen, Mads

A2 - Heyden, Anders

A2 - Sparr, Gunnar

A2 - Johansen, Peter

PB - Springer-Verlag

ER -