TY - GEN
T1 - Object Discovery from Motion-Guided Tokens
AU - Bao, Zhipeng
AU - Tokmakov, Pavel
AU - Wang, Yu Xiong
AU - Gaidon, Adrien
AU - Hebert, Martial
N1 - Acknowledgements. We thank Dian Chen, Alexei Efros, and Andrew Owens for their valuable comments. This research was supported by Toyota Research Institute. YXW was supported in part by NSF Grant 2106825, NIFA Award 2020-67021-32799, and the NCSA Fellows program.
PY - 2023
Y1 - 2023
N2 - Object discovery - separating objects from the background without manual labels - is a fundamental open challenge in computer vision. Previous methods struggle to go beyond clustering of low-level cues, whether handcrafted (e.g., color, texture) or learned (e.g., from auto-encoders). In this work, we augment the auto-encoder representation learning framework with two key components: motion-guidance and mid-level feature tokenization. Although both have been separately investigated, we introduce a new transformer decoder showing that their benefits can compound thanks to motion-guided vector quantization. We show that our architecture effectively leverages the synergy between motion and tokenization, improving upon the state of the art on both synthetic and real datasets. Our approach enables the emergence of interpretable object-specific mid-level features, demonstrating the benefits of motion-guidance (no labeling) and quantization (interpretability, memory efficiency).
AB - Object discovery - separating objects from the background without manual labels - is a fundamental open challenge in computer vision. Previous methods struggle to go beyond clustering of low-level cues, whether handcrafted (e.g., color, texture) or learned (e.g., from auto-encoders). In this work, we augment the auto-encoder representation learning framework with two key components: motion-guidance and mid-level feature tokenization. Although both have been separately investigated, we introduce a new transformer decoder showing that their benefits can compound thanks to motion-guided vector quantization. We show that our architecture effectively leverages the synergy between motion and tokenization, improving upon the state of the art on both synthetic and real datasets. Our approach enables the emergence of interpretable object-specific mid-level features, demonstrating the benefits of motion-guidance (no labeling) and quantization (interpretability, memory efficiency).
KW - Segmentation
KW - grouping and shape analysis
UR - http://www.scopus.com/inward/record.url?scp=85165317748&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85165317748&partnerID=8YFLogxK
U2 - 10.1109/CVPR52729.2023.02200
DO - 10.1109/CVPR52729.2023.02200
M3 - Conference contribution
AN - SCOPUS:85165317748
T3 - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
SP - 22972
EP - 22981
BT - Proceedings - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
PB - IEEE Computer Society
T2 - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
Y2 - 18 June 2023 through 22 June 2023
ER -