Recognition as translating images into text

Kobus Barnard, Pinar Duygulu, David Alexander Forsyth

Research output: Contribution to journalConference article

Abstract

We present an overview of a new paradigm for tackling long standing computer vision problems. Specifically our approach is to build statistical models which translate from a visual representations (images) to semantic ones (associated text). As providing optimal text for training is difficult at best, we propose working with whatever associated text is available in large quantities. Examples include large image collections with keywords, museum image collections with descriptive text, news photos, and images on the web. In this paper we discuss how the translation approach can give a handle on difficult questions such as: What counts as an object? Which objects are easy to recognize and which are hard? Which objects are indistinguishable using our features? How to integrate low level vision processes such as feature based segmentation, with high level processes such as grouping. We also summarize some of the models proposed for translating from visual information to text, and some of the methods used to evaluate their performance.

Original languageEnglish (US)
Pages (from-to)168-178
Number of pages11
JournalProceedings of SPIE - The International Society for Optical Engineering
Volume5018
DOIs
StatePublished - May 26 2003
Externally publishedYes
EventInternet Imaging IV - Santa Clara, CA, United States
Duration: Jan 21 2003Jan 22 2003

Fingerprint

translating
Museums
Computer vision
Semantics
museums
Image Representation
news
semantics
computer vision
Grouping
Computer Vision
Statistical Model
Count
education
Segmentation
Paradigm
Integrate
Statistical Models
Text
Evaluate

Keywords

  • Aspect model
  • Hierarchical clustering
  • Learning image semantics
  • Machine translation
  • Object recognition

ASJC Scopus subject areas

  • Electronic, Optical and Magnetic Materials
  • Condensed Matter Physics
  • Computer Science Applications
  • Applied Mathematics
  • Electrical and Electronic Engineering

Cite this

Recognition as translating images into text. / Barnard, Kobus; Duygulu, Pinar; Forsyth, David Alexander.

In: Proceedings of SPIE - The International Society for Optical Engineering, Vol. 5018, 26.05.2003, p. 168-178.

Research output: Contribution to journalConference article

@article{bc24e1ba2f844b47be5def0a51c065a7,
title = "Recognition as translating images into text",
abstract = "We present an overview of a new paradigm for tackling long standing computer vision problems. Specifically our approach is to build statistical models which translate from a visual representations (images) to semantic ones (associated text). As providing optimal text for training is difficult at best, we propose working with whatever associated text is available in large quantities. Examples include large image collections with keywords, museum image collections with descriptive text, news photos, and images on the web. In this paper we discuss how the translation approach can give a handle on difficult questions such as: What counts as an object? Which objects are easy to recognize and which are hard? Which objects are indistinguishable using our features? How to integrate low level vision processes such as feature based segmentation, with high level processes such as grouping. We also summarize some of the models proposed for translating from visual information to text, and some of the methods used to evaluate their performance.",
keywords = "Aspect model, Hierarchical clustering, Learning image semantics, Machine translation, Object recognition",
author = "Kobus Barnard and Pinar Duygulu and Forsyth, {David Alexander}",
year = "2003",
month = "5",
day = "26",
doi = "10.1117/12.478427",
language = "English (US)",
volume = "5018",
pages = "168--178",
journal = "Proceedings of SPIE - The International Society for Optical Engineering",
issn = "0277-786X",
publisher = "SPIE",

}

TY - JOUR

T1 - Recognition as translating images into text

AU - Barnard, Kobus

AU - Duygulu, Pinar

AU - Forsyth, David Alexander

PY - 2003/5/26

Y1 - 2003/5/26

N2 - We present an overview of a new paradigm for tackling long standing computer vision problems. Specifically our approach is to build statistical models which translate from a visual representations (images) to semantic ones (associated text). As providing optimal text for training is difficult at best, we propose working with whatever associated text is available in large quantities. Examples include large image collections with keywords, museum image collections with descriptive text, news photos, and images on the web. In this paper we discuss how the translation approach can give a handle on difficult questions such as: What counts as an object? Which objects are easy to recognize and which are hard? Which objects are indistinguishable using our features? How to integrate low level vision processes such as feature based segmentation, with high level processes such as grouping. We also summarize some of the models proposed for translating from visual information to text, and some of the methods used to evaluate their performance.

AB - We present an overview of a new paradigm for tackling long standing computer vision problems. Specifically our approach is to build statistical models which translate from a visual representations (images) to semantic ones (associated text). As providing optimal text for training is difficult at best, we propose working with whatever associated text is available in large quantities. Examples include large image collections with keywords, museum image collections with descriptive text, news photos, and images on the web. In this paper we discuss how the translation approach can give a handle on difficult questions such as: What counts as an object? Which objects are easy to recognize and which are hard? Which objects are indistinguishable using our features? How to integrate low level vision processes such as feature based segmentation, with high level processes such as grouping. We also summarize some of the models proposed for translating from visual information to text, and some of the methods used to evaluate their performance.

KW - Aspect model

KW - Hierarchical clustering

KW - Learning image semantics

KW - Machine translation

KW - Object recognition

UR - http://www.scopus.com/inward/record.url?scp=0038057932&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0038057932&partnerID=8YFLogxK

U2 - 10.1117/12.478427

DO - 10.1117/12.478427

M3 - Conference article

AN - SCOPUS:0038057932

VL - 5018

SP - 168

EP - 178

JO - Proceedings of SPIE - The International Society for Optical Engineering

JF - Proceedings of SPIE - The International Society for Optical Engineering

SN - 0277-786X

ER -