TY - GEN
T1 - MACAROON
T2 - 2024 Findings of the Association for Computational Linguistics, EMNLP 2024
AU - Wu, Shujin
AU - Fung, May
AU - Li, Sha
AU - Wan, Yixin
AU - Chang, Kai Wei
AU - Ji, Heng
N1 - This research is based upon work supported DARPA ITM Program No.FA8650-23-C-7316 and the AI Research Institutes program by National Science Foundation and the Institute of Education Sciences, U.S.Department of Education through Award # 2229873-AI Institute for Transforming Education for Children with Speech and Language Processing Challenges.The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S.Government.The U.S.Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
PY - 2024
Y1 - 2024
N2 - Large vision-language models (LVLMs), while proficient in following instructions and responding to diverse questions, invariably generate detailed responses even when questions are ambiguous or unanswerable, leading to hallucinations and bias issues.Thus, it is essential for LVLMs to proactively engage with humans to ask for clarifications or additional information for better responses.In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners.We begin by establishing a three-tiered hierarchy for questions of invalid, ambiguous, and personalizable nature to measure the proactive engagement capabilities of LVLMs.Utilizing this hierarchy, we create PIE (ProactIve Engagement Evaluation) through GPT-4o and human annotators, consisting of 853 questions across six distinct, fine-grained question types that are verified by human annotators and accompanied with well-defined metrics.Our evaluations on PIE indicate poor performance of existing LVLMs, with the best-performing open-weights model only achieving an Aggregate Align Rate (AAR) of 0.28.In response, we introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions given the task description and human-crafted criteria.Then, the self-imagined data is formatted for conditional reinforcement learning.Experimental results show MACAROON effectively improves LVLMs' capabilities to be proactively engaged (0.84 AAR) while maintaining comparable performance on general tasks.
AB - Large vision-language models (LVLMs), while proficient in following instructions and responding to diverse questions, invariably generate detailed responses even when questions are ambiguous or unanswerable, leading to hallucinations and bias issues.Thus, it is essential for LVLMs to proactively engage with humans to ask for clarifications or additional information for better responses.In this study, we aim to shift LVLMs from passive answer providers to proactive engaged partners.We begin by establishing a three-tiered hierarchy for questions of invalid, ambiguous, and personalizable nature to measure the proactive engagement capabilities of LVLMs.Utilizing this hierarchy, we create PIE (ProactIve Engagement Evaluation) through GPT-4o and human annotators, consisting of 853 questions across six distinct, fine-grained question types that are verified by human annotators and accompanied with well-defined metrics.Our evaluations on PIE indicate poor performance of existing LVLMs, with the best-performing open-weights model only achieving an Aggregate Align Rate (AAR) of 0.28.In response, we introduce MACAROON, self-iMaginAtion for ContrAstive pReference OptimizatiON, which instructs LVLMs to autonomously generate contrastive response pairs for unlabeled questions given the task description and human-crafted criteria.Then, the self-imagined data is formatted for conditional reinforcement learning.Experimental results show MACAROON effectively improves LVLMs' capabilities to be proactively engaged (0.84 AAR) while maintaining comparable performance on general tasks.
UR - https://www.scopus.com/pages/publications/85217622368
UR - https://www.scopus.com/pages/publications/85217622368#tab=citedBy
U2 - 10.18653/v1/2024.findings-emnlp.454
DO - 10.18653/v1/2024.findings-emnlp.454
M3 - Conference contribution
AN - SCOPUS:85217622368
T3 - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
SP - 7715
EP - 7731
BT - EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024
A2 - Al-Onaizan, Yaser
A2 - Bansal, Mohit
A2 - Chen, Yun-Nung
PB - Association for Computational Linguistics (ACL)
Y2 - 12 November 2024 through 16 November 2024
ER -