TY - GEN
T1 - Weakly supervised learning of object segmentations from web-scale video
AU - Hartmann, Glenn
AU - Grundmann, Matthias
AU - Hoffman, Judy
AU - Tsai, David
AU - Kwatra, Vivek
AU - Madani, Omid
AU - Vijayanarasimhan, Sudheendra
AU - Essa, Irfan
AU - Rehg, James
AU - Sukthankar, Rahul
PY - 2012
Y1 - 2012
N2 - We propose to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos. Specifically, given a large collection of raw YouTube content, along with potentially noisy tags, our goal is to automatically generate spatiotemporal masks for each object, such as "dog", without employing any pre-trained object detectors. We formulate this problem as learning weakly supervised classifiers for a set of independent spatio-temporal segments. The object seeds obtained using segment-level classifiers are further refined using graphcuts to generate high-precision object masks. Our results, obtained by training on a dataset of 20,000 YouTube videos weakly tagged into 15 classes, demonstrate automatic extraction of pixel-level object masks. Evaluated against a ground-truthed subset of 50,000 frames with pixel-level annotations, we confirm that our proposed methods can learn good object masks just by watching YouTube.
AB - We propose to learn pixel-level segmentations of objects from weakly labeled (tagged) internet videos. Specifically, given a large collection of raw YouTube content, along with potentially noisy tags, our goal is to automatically generate spatiotemporal masks for each object, such as "dog", without employing any pre-trained object detectors. We formulate this problem as learning weakly supervised classifiers for a set of independent spatio-temporal segments. The object seeds obtained using segment-level classifiers are further refined using graphcuts to generate high-precision object masks. Our results, obtained by training on a dataset of 20,000 YouTube videos weakly tagged into 15 classes, demonstrate automatic extraction of pixel-level object masks. Evaluated against a ground-truthed subset of 50,000 frames with pixel-level annotations, we confirm that our proposed methods can learn good object masks just by watching YouTube.
UR - https://www.scopus.com/pages/publications/84867694357
UR - https://www.scopus.com/pages/publications/84867694357#tab=citedBy
U2 - 10.1007/978-3-642-33863-2_20
DO - 10.1007/978-3-642-33863-2_20
M3 - Conference contribution
AN - SCOPUS:84867694357
SN - 9783642338625
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 198
EP - 208
BT - Computer Vision, ECCV 2012 - Workshops and Demonstrations, Proceedings
PB - Springer
T2 - Computer Vision, ECCV 2012 - Workshops and Demonstrations, Proceedings
Y2 - 7 October 2012 through 13 October 2012
ER -