MACAROON: Training Vision-Language Models To Be Your Engaged Partners

  • Shujin Wu
  • , May Fung
  • , Sha Li
  • , Yixin Wan
  • , Kai Wei Chang
  • , Heng Ji

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Large vision-language models (LVLMs), while proficient in following instructions and responding to diverse questions, invariably generate detailed responses even when questions are ambiguous or unanswerable, leading to hallucinations and bias issues.Thus, it is essential for LVLMs to proactively engage with humans to ask for clarifications or additional information for better responses.In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners.We begin by establishing a three-tiered hierarchy for questions of invalid, ambiguous, and personalizable nature to measure the proactive engagement capabilities of LVLMs.Utilizing this hierarchy, we create PIE (ProactIve Engagement Evaluation) through GPT-4o and human annotators, consisting of 853 questions across six distinct, fine-grained question types that are verified by human annotators and accompanied with well-defined metrics.Our evaluations on PIE indicate poor performance of existing LVLMs, with the best-performing open-weights model only achieving an Aggregate Align Rate (AAR) of 0.28.In response, we introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions given the task description and human-crafted criteria.Then, the self-imagined data is formatted for conditional reinforcement learning.Experimental results show MACAROON effectively improves LVLMs' capabilities to be proactively engaged (0.84 AAR) while maintaining comparable performance on general tasks.

Original languageEnglish (US)
Title of host publicationEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
EditorsYaser Al-Onaizan, Mohit Bansal, Yun-Nung Chen
PublisherAssociation for Computational Linguistics (ACL)
Pages7715-7731
Number of pages17
ISBN (Electronic)9798891761681
DOIs
StatePublished - 2024
Event2024 Findings of the Association for Computational Linguistics, EMNLP 2024 - Hybrid, Miami, United States
Duration: Nov 12 2024Nov 16 2024

Publication series

NameEMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024

Conference

Conference2024 Findings of the Association for Computational Linguistics, EMNLP 2024
Country/TerritoryUnited States
CityHybrid, Miami
Period11/12/2411/16/24

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'MACAROON: Training Vision-Language Models To Be Your Engaged Partners'. Together they form a unique fingerprint.

Cite this