TY - GEN
T1 - Learning Dense Object Descriptors from Multiple Views for Low-shot Category Generalization
AU - Stojanov, Stefan
AU - Thai, Anh
AU - Huang, Zixuan
AU - Rehg, James M.
N1 - Funding Information:
This work was supported in part by NIH R01HD104624-01A1, NIH R01MH114999, NSF OIA2033413, and a gift from Facebook. We thank Miao Liu and Wenqi Jia for their helpful feedback and discussion.
Publisher Copyright:
© 2022 Neural information processing systems foundation. All rights reserved.
PY - 2022
Y1 - 2022
N2 - A hallmark of the deep learning era for computer vision is the successful use of large-scale labeled datasets to train feature representations. This has been done for tasks ranging from object recognition and semantic segmentation to optical flow estimation and novel view synthesis of 3D scenes. In this work, we aim to learn dense discriminative object representations for low-shot category recognition without requiring any category labels. To this end, we propose Deep Object Patch Encodings (DOPE), which can be trained from multiple views of object instances without any category or semantic object part labels. To train DOPE, we assume access to sparse depths, foreground masks and known cameras, to obtain pixel-level correspondences between views of an object, and use this to formulate a self-supervised learning task to learn discriminative object patches. We find that DOPE can directly be used for low-shot classification of novel categories using local-part matching, and is competitive with and outperforms supervised and self-supervised learning baselines. Code and data available at https://github.com/rehg-lab/dope_selfsup.
AB - A hallmark of the deep learning era for computer vision is the successful use of large-scale labeled datasets to train feature representations. This has been done for tasks ranging from object recognition and semantic segmentation to optical flow estimation and novel view synthesis of 3D scenes. In this work, we aim to learn dense discriminative object representations for low-shot category recognition without requiring any category labels. To this end, we propose Deep Object Patch Encodings (DOPE), which can be trained from multiple views of object instances without any category or semantic object part labels. To train DOPE, we assume access to sparse depths, foreground masks and known cameras, to obtain pixel-level correspondences between views of an object, and use this to formulate a self-supervised learning task to learn discriminative object patches. We find that DOPE can directly be used for low-shot classification of novel categories using local-part matching, and is competitive with and outperforms supervised and self-supervised learning baselines. Code and data available at https://github.com/rehg-lab/dope_selfsup.
UR - http://www.scopus.com/inward/record.url?scp=85163209189&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85163209189&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85163209189
T3 - Advances in Neural Information Processing Systems
BT - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
A2 - Koyejo, S.
A2 - Mohamed, S.
A2 - Agarwal, A.
A2 - Belgrave, D.
A2 - Cho, K.
A2 - Oh, A.
PB - Curran Associates Inc.
T2 - 36th Conference on Neural Information Processing Systems, NeurIPS 2022
Y2 - 28 November 2022 through 9 December 2022
ER -