TY - JOUR
T1 - AutoFocusFormer
T2 - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023
AU - Ziwen, Chen
AU - Patnaik, Kaushik
AU - Zhai, Shuangfei
AU - Wan, Alvin
AU - Ren, Zhile
AU - Schwing, Alex
AU - Colburn, Alex
AU - Fuxin, Li
N1 - We thank Dr. Hanlin Goh and Dr. Tatiana Likhomanenko for valuable suggestions to improve the paper. Chen Ziwen is partially supported by NSF grant #1751402.
PY - 2023
Y1 - 2023
N2 - Real world images often have highly imbalanced content density. Some areas are very uniform, e.g., large patches of blue sky, while other areas are scattered with many small objects. Yet, the commonly used successive grid downsampling strategy in convolutional deep networks treats all areas equally. Hence, small objects are represented in very few spatial locations, leading to worse results in tasks such as segmentation. Intuitively, retaining more pixels representing small objects during downsampling helps to preserve important information. To achieve this, we propose AutoFocusFormer (AFF), a local-attention transformer image recognition backbone, which performs adaptive down-sampling by learning to retain the most important pixels for the task. Since adaptive downsampling generates a set of pixels irregularly distributed on the image plane, we abandon the classic grid structure. Instead, we develop a novel point-based local attention block, facilitated by a balanced clustering module and a learnable neighborhood merging module, which yields representations for our point-based versions of state-of-the-art segmentation heads. Experiments show that our AutoFocusFormer (AFF) improves significantly over baseline models of similar sizes.
AB - Real world images often have highly imbalanced content density. Some areas are very uniform, e.g., large patches of blue sky, while other areas are scattered with many small objects. Yet, the commonly used successive grid downsampling strategy in convolutional deep networks treats all areas equally. Hence, small objects are represented in very few spatial locations, leading to worse results in tasks such as segmentation. Intuitively, retaining more pixels representing small objects during downsampling helps to preserve important information. To achieve this, we propose AutoFocusFormer (AFF), a local-attention transformer image recognition backbone, which performs adaptive down-sampling by learning to retain the most important pixels for the task. Since adaptive downsampling generates a set of pixels irregularly distributed on the image plane, we abandon the classic grid structure. Instead, we develop a novel point-based local attention block, facilitated by a balanced clustering module and a learnable neighborhood merging module, which yields representations for our point-based versions of state-of-the-art segmentation heads. Experiments show that our AutoFocusFormer (AFF) improves significantly over baseline models of similar sizes.
KW - Deep learning architectures and techniques
UR - http://www.scopus.com/inward/record.url?scp=85192505783&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85192505783&partnerID=8YFLogxK
U2 - 10.1109/CVPR52729.2023.01748
DO - 10.1109/CVPR52729.2023.01748
M3 - Conference article
AN - SCOPUS:85192505783
SN - 1063-6919
VL - 2023-June
SP - 17227
EP - 17236
JO - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
JF - Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Y2 - 18 June 2023 through 22 June 2023
ER -