LEARNING TO DECOMPOSE VISUAL FEATURES WITH LATENT TEXTUAL PROMPTS

Feng Wang, Manling Li, Xudong Lin, Hairong Lv, Alexander G. Schwing, Heng Ji

Research output: Contribution to conferencePaperpeer-review

Abstract

Recent advances in pre-training vision-language models like CLIP (Radford et al., 2021) have shown great potential in learning transferable visual representations. Nonetheless, for downstream inference, CLIP-like models suffer from either 1) degraded accuracy and robustness when inferring by retrieving textual class names (the zero-shot protocol); or 2) breaking the well-established vision-language alignment (linear probing). To combine the best of both worlds, we propose Decomposed Feature Prompting (DeFo). DeFo maintains the dual-model architecture yet leverages learnable embeddings as textual input and performs classification with an additional linear layer. As a result, we find DeFo to be able to extract decomposed visual features with the help of textual prompts and to allow a scalable size of language inputs. Our empirical study shows DeFo's significance in improving the vision-language models. For example, DeFo obtains 73.2% test accuracy on ImageNet with a ResNet-50 backbone without tuning any pretrained weights of both the vision and language encoder, outperforming zero-shot CLIP by a large margin of 15.0%, and outperforming state-of-the-art vision-language prompt tuning by 7.6%.

Original languageEnglish (US)
StatePublished - 2023
Event11th International Conference on Learning Representations, ICLR 2023 - Kigali, Rwanda
Duration: May 1 2023May 5 2023

Conference

Conference11th International Conference on Learning Representations, ICLR 2023
Country/TerritoryRwanda
CityKigali
Period5/1/235/5/23

ASJC Scopus subject areas

  • Language and Linguistics
  • Computer Science Applications
  • Education
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'LEARNING TO DECOMPOSE VISUAL FEATURES WITH LATENT TEXTUAL PROMPTS'. Together they form a unique fingerprint.

Cite this