Automatic video annotation by mining speech transcripts

Atulya Velivelli, Thomas S. Huang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We describe a model for automatic prediction of text annotations for video data. The speech transcripts of videos, are clustered using an aspect model and keywords are extracted based on aspect distribution. Thus we capture the semantic information available in the video data. This technique for automatic keyword vocabulary construction makes the labelling of video data a very easy task. We then build a video shot vocabulary by utilizing both static images and motion cues. We use a maximum entropy criterion to learn the conditional exponential model by defining constraint features over the shot vocabulary, keyword vocabulary combinations. Our method uses a maximum a posteriori estimate of exponential model to predict the annotations. We evaluate the ability of our model to predict annotations, in terms of mean negative log-likelihood and retrieval performance on the test set. A comparison of exponential model with baseline methods indicates that the results are encouraging.

Original languageEnglish (US)
Title of host publication2006 Conference on Computer Vision and Pattern Recognition Workshop
DOIs
StatePublished - 2006
Externally publishedYes
Event2006 Conference on Computer Vision and Pattern Recognition Workshops - New York, NY, United States
Duration: Jun 17 2006Jun 22 2006

Publication series

NameProceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Volume2006
ISSN (Print)1063-6919

Other

Other2006 Conference on Computer Vision and Pattern Recognition Workshops
Country/TerritoryUnited States
CityNew York, NY
Period6/17/066/22/06

ASJC Scopus subject areas

  • Software
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Automatic video annotation by mining speech transcripts'. Together they form a unique fingerprint.

Cite this