Enhancing video summarization via vision-language embedding

Bryan A. Plummer, Matthew Brown, Svetlana Lazebnik

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper addresses video summarization, or the problem of distilling a raw video into a shorter form while still capturing the original story. We show that visual representations supervised by freeform language make a good fit for this application by extending a recent submodular summarization approach [9] with representativeness and interestingness objectives computed on features from a joint vision-language embedding space. We perform an evaluation on two diverse datasets, UT Egocentric [18] and TV Episodes [45], and show that our new objectives give improved summarization ability compared to standard visual features alone. Our experiments also show that the vision-language embedding need not be trained on domain-specific data, but can be learned from standard still image vision-language datasets and transferred to video. A further benefit of our model is the ability to guide a summary using freeform text input at test time, allowing user customization.

Original languageEnglish (US)
Title of host publicationProceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1052-1060
Number of pages9
ISBN (Electronic)9781538604571
DOIs
StatePublished - Nov 6 2017
Event30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017 - Honolulu, United States
Duration: Jul 21 2017Jul 26 2017

Publication series

NameProceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017
Volume2017-January

Other

Other30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017
CountryUnited States
CityHonolulu
Period7/21/177/26/17

ASJC Scopus subject areas

  • Signal Processing
  • Computer Vision and Pattern Recognition

Fingerprint Dive into the research topics of 'Enhancing video summarization via vision-language embedding'. Together they form a unique fingerprint.

Cite this