TY - JOUR
T1 - Region-Based Representations Revisited
AU - Shlapentokh-Rothman, Michal
AU - Blume, Ansel
AU - Xiao, Yao
AU - Wu, Yuqun
AU - Sethuraman, T. V.
AU - Tao, Heyi
AU - Lee, Jae Yong
AU - Torres, Wilfredo
AU - Wang, Yuxiong
AU - Hoiem, Derek
N1 - This work is supported in part by the following awards: ONR N00014-23-1-2383, ONR N00014-21-1-2705, DARPA HR0011-23-9-0060, NSF IIS 23-12102. The views and conclusions expressed are those of the authors, and not necessarily representative of the US Government or its agencies.
PY - 2024
Y1 - 2024
N2 - We investigate whether region-based representations are effective for recognition. Regions were once a mainstay in recognition approaches, but pixel and patch-based features are now used almost exclusively. We show that recent class-agnostic segmenters like SAM can be effectively combined with strong self-supervised representations, like those from DINOv2, and used for a wide variety of tasks, including semantic segmentation, object-based image re-trieval, and multi-image analysis. Once the masks and features are extracted, these representations, even with linear decoders, enable competitive performance, making them well suited to applications that require custom queries. The representations' compactness also makes them well-suited to video analysis and other problems requiring inference across many images.
AB - We investigate whether region-based representations are effective for recognition. Regions were once a mainstay in recognition approaches, but pixel and patch-based features are now used almost exclusively. We show that recent class-agnostic segmenters like SAM can be effectively combined with strong self-supervised representations, like those from DINOv2, and used for a wide variety of tasks, including semantic segmentation, object-based image re-trieval, and multi-image analysis. Once the masks and features are extracted, these representations, even with linear decoders, enable competitive performance, making them well suited to applications that require custom queries. The representations' compactness also makes them well-suited to video analysis and other problems requiring inference across many images.
UR - http://www.scopus.com/inward/record.url?scp=85209053518&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85209053518&partnerID=8YFLogxK
U2 - 10.1109/CVPR52733.2024.01619
DO - 10.1109/CVPR52733.2024.01619
M3 - Conference article
AN - SCOPUS:85209053518
SN - 1063-6919
SP - 17107
EP - 17116
JO - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
JF - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
T2 - 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024
Y2 - 16 June 2024 through 22 June 2024
ER -