TY - GEN
T1 - ViStruct
T2 - 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023
AU - Chen, Yangyi
AU - Wang, Xingyao
AU - Li, Manling
AU - Hoiem, Derek
AU - Ji, Heng
N1 - We thank the anonymous reviewers for their suggestions and comments. This research is based upon work supported by U.S. DARPA ECOLE Program No. HR00112390060 and U.S. DARPA KAIROS Program No. FA8750-19-2-1004. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
PY - 2023
Y1 - 2023
N2 - State-of-the-art vision-language models (VLMs) still have limited performance in structural knowledge extraction, such as relations between objects. In this work, we present ViStruct, a training framework to learn VLMs for effective visual structural knowledge extraction. Two novel designs are incorporated. First, we propose to leverage the inherent structure of programming language to depict visual structural information. This approach enables explicit and consistent representation of visual structural information of multiple granularities, such as concepts relations, and events, in a well-organized structured format. Second, we introduce curriculum-based learning for VLMs to pro gressively comprehend visual structures, from fundamental visual concepts to intricate event structures. Our intuition is that lower-level knowledge may contribute to complex visual structure understanding. Furthermore, we compile and release a collection of datasets tailored for visual structural knowledge extraction. We adopt a weakly-supervised approach to directly generate visual event structures from captions for ViStruct training capitalizing on abundant image-caption pairs from the web. In experiments, we evaluate ViStruct on visual structure prediction tasks demonstrating its effectiveness in improving the understanding of visual structures. The code is public at https://github.com/Yangyi-Chen/vi-struct.
AB - State-of-the-art vision-language models (VLMs) still have limited performance in structural knowledge extraction, such as relations between objects. In this work, we present ViStruct, a training framework to learn VLMs for effective visual structural knowledge extraction. Two novel designs are incorporated. First, we propose to leverage the inherent structure of programming language to depict visual structural information. This approach enables explicit and consistent representation of visual structural information of multiple granularities, such as concepts relations, and events, in a well-organized structured format. Second, we introduce curriculum-based learning for VLMs to pro gressively comprehend visual structures, from fundamental visual concepts to intricate event structures. Our intuition is that lower-level knowledge may contribute to complex visual structure understanding. Furthermore, we compile and release a collection of datasets tailored for visual structural knowledge extraction. We adopt a weakly-supervised approach to directly generate visual event structures from captions for ViStruct training capitalizing on abundant image-caption pairs from the web. In experiments, we evaluate ViStruct on visual structure prediction tasks demonstrating its effectiveness in improving the understanding of visual structures. The code is public at https://github.com/Yangyi-Chen/vi-struct.
UR - http://www.scopus.com/inward/record.url?scp=85184799635&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85184799635&partnerID=8YFLogxK
U2 - 10.18653/v1/2023.emnlp-main.824
DO - 10.18653/v1/2023.emnlp-main.824
M3 - Conference contribution
AN - SCOPUS:85184799635
T3 - EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings
SP - 13342
EP - 13357
BT - EMNLP 2023 - 2023 Conference on Empirical Methods in Natural Language Processing, Proceedings
A2 - Bouamor, Houda
A2 - Pino, Juan
A2 - Bali, Kalika
PB - Association for Computational Linguistics (ACL)
Y2 - 6 December 2023 through 10 December 2023
ER -