Abstract
Collections of digital pictures are now very common. Collections can range from a small set of family pictures, to the entire contents of a picture site like Flickr. Such collections differ from what one might see if one simply attached a camera to a robot and recorded everything, because the pictures have been selected by people. They are not necessarily “good” pictures (say, by standards of photographic aesthetics), but, because they have been chosen, they display quite strong trends. It is common for such pictures to have associated text, which might be keywords or tags but is often in the form of sentences or brief paragraphs. Text could be a caption (a set of remarks explicitly bound to the picture, and often typeset in a way that emphasizes this), region labels (terms associated with image regions, perhaps identifying what is in that region), annotations (terms associated with the whole picture, often identifying objects in the picture), or just nearby text. We review a series of ideas about how to exploit associated text to help interpret pictures. Word Frequencies, Objects, and Scenes Most pictures in electronic form seem to have related words nearby (or sound or metadata, and so on; we focus on words), so it is easy to collect word and picture datasets, and there are many examples. Such multimode collections should probably be seen as the usual case, because one usually has to deliberately ignore information to collect only images.
Original language | English (US) |
---|---|
Title of host publication | Object Categorization |
Subtitle of host publication | Computer and Human Vision Perspectives |
Publisher | Cambridge University Press |
Pages | 167-181 |
Number of pages | 15 |
Volume | 9780521887380 |
ISBN (Electronic) | 9780511635465 |
ISBN (Print) | 9780521887380 |
DOIs | |
State | Published - Jan 1 2009 |
ASJC Scopus subject areas
- Computer Science(all)