TY - JOUR
T1 - Visually Descriptive Language Model for Vector Graphics Reasoning
AU - Wang, Zhenhailong
AU - Hsu, Joy
AU - Wang, Xingyao
AU - Huang, Kuan Hao
AU - Li, Manling
AU - Wu, Jiajun
AU - Ji, Heng
N1 - This research is based upon work supported by U.S. DARPA ECOLE Program No. #HR00112390060, AFOSR YIP FA9550-23-1-0127, ONR N00014-23-1-2355, ONR YIP N00014-24-1-2117, ONR MURI N00014-24-1-2748, and the Stanford Institute for Human-Centered AI (HAI). The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of DARPA, or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for governmental purposes notwithstanding any copyright annotation therein.
PY - 2025
Y1 - 2025
N2 - Despite significant advancements, current large multimodal models (LMMs) struggle to bridge the gap between low-level visual perception—focusing on shapes, sizes, and layouts— and high-level language reasoning involving semantics, events, and logic. This limitation becomes evident in tasks requiring precise visual perception, such as comparing geometric properties or solving visual algorithmic reasoning problems. To study this failure mode, we focus on an important visual domain: vector graphics—images composed purely of 2D objects and shapes, which are prevalent in web and mobile environments. Importantly, we consider rasterized vector graphics without assuming access to their underlying vector code. We identify two key research questions: how can we enable precise visual perception, and how can we facilitate high-level reasoning based on such low-level perceptions? To accurately capture low-level visual details, we explore using SVG for the precise encoding of visual scenes. However, SVGs are not readily interpretable by LLMs or LMMs in a zero-shot manner. To address this challenge, we propose the Visually Descriptive Language Model (VDLM) to build a bridge between low-level visual perception and high-level language reasoning. VDLM learns an intermediate symbolic representation called Primal Visual Description (PVD), which translates raw SVGs into a higher-level abstraction comprising primitive attributes. This abstraction allows for direct interpretation by foundation models for zero-shot generalization to different reasoning tasks. As an initial step to construct a descriptive intermediate representation for low-level visual reasoning, the SVG-to-PVD model is currently limited to simple compositions of primitive shapes, for which synthetic data can be generated without human annotation. Nevertheless, empirical experiments show that VDLM leads to significant improvements in state-of-the-art LMMs, such as GPT-4o, across various low-level multimodal perception and reasoning tasks on rasterized vector graphics. Additionally, we provide extensive analyses of VDLM’s performance, showing that our framework offers improved interpretability due to its disentangled perception and reasoning processes. We also conduct an in-depth error analysis, highlighting remaining limitations and suggesting directions for future research.
AB - Despite significant advancements, current large multimodal models (LMMs) struggle to bridge the gap between low-level visual perception—focusing on shapes, sizes, and layouts— and high-level language reasoning involving semantics, events, and logic. This limitation becomes evident in tasks requiring precise visual perception, such as comparing geometric properties or solving visual algorithmic reasoning problems. To study this failure mode, we focus on an important visual domain: vector graphics—images composed purely of 2D objects and shapes, which are prevalent in web and mobile environments. Importantly, we consider rasterized vector graphics without assuming access to their underlying vector code. We identify two key research questions: how can we enable precise visual perception, and how can we facilitate high-level reasoning based on such low-level perceptions? To accurately capture low-level visual details, we explore using SVG for the precise encoding of visual scenes. However, SVGs are not readily interpretable by LLMs or LMMs in a zero-shot manner. To address this challenge, we propose the Visually Descriptive Language Model (VDLM) to build a bridge between low-level visual perception and high-level language reasoning. VDLM learns an intermediate symbolic representation called Primal Visual Description (PVD), which translates raw SVGs into a higher-level abstraction comprising primitive attributes. This abstraction allows for direct interpretation by foundation models for zero-shot generalization to different reasoning tasks. As an initial step to construct a descriptive intermediate representation for low-level visual reasoning, the SVG-to-PVD model is currently limited to simple compositions of primitive shapes, for which synthetic data can be generated without human annotation. Nevertheless, empirical experiments show that VDLM leads to significant improvements in state-of-the-art LMMs, such as GPT-4o, across various low-level multimodal perception and reasoning tasks on rasterized vector graphics. Additionally, we provide extensive analyses of VDLM’s performance, showing that our framework offers improved interpretability due to its disentangled perception and reasoning processes. We also conduct an in-depth error analysis, highlighting remaining limitations and suggesting directions for future research.
UR - https://www.scopus.com/pages/publications/105007360773
UR - https://www.scopus.com/inward/citedby.url?scp=105007360773&partnerID=8YFLogxK
M3 - Article
AN - SCOPUS:105007360773
SN - 2835-8856
VL - 2025
JO - Transactions on Machine Learning Research
JF - Transactions on Machine Learning Research
ER -