Visually Descriptive Language Model for Vector Graphics Reasoning

Zhenhailong Wang, Joy Hsu, Xingyao Wang, Kuan Hao Huang, Manling Li, Jiajun Wu, Heng Ji

Research output: Contribution to journalArticlepeer-review

Abstract

Despite significant advancements, current large multimodal models (LMMs) struggle to bridge the gap between low-level visual perception—focusing on shapes, sizes, and layouts— and high-level language reasoning involving semantics, events, and logic. This limitation becomes evident in tasks requiring precise visual perception, such as comparing geometric properties or solving visual algorithmic reasoning problems. To study this failure mode, we focus on an important visual domain: vector graphics—images composed purely of 2D objects and shapes, which are prevalent in web and mobile environments. Importantly, we consider rasterized vector graphics without assuming access to their underlying vector code. We identify two key research questions: how can we enable precise visual perception, and how can we facilitate high-level reasoning based on such low-level perceptions? To accurately capture low-level visual details, we explore using SVG for the precise encoding of visual scenes. However, SVGs are not readily interpretable by LLMs or LMMs in a zero-shot manner. To address this challenge, we propose the Visually Descriptive Language Model (VDLM) to build a bridge between low-level visual perception and high-level language reasoning. VDLM learns an intermediate symbolic representation called Primal Visual Description (PVD), which translates raw SVGs into a higher-level abstraction comprising primitive attributes. This abstraction allows for direct interpretation by foundation models for zero-shot generalization to different reasoning tasks. As an initial step to construct a descriptive intermediate representation for low-level visual reasoning, the SVG-to-PVD model is currently limited to simple compositions of primitive shapes, for which synthetic data can be generated without human annotation. Nevertheless, empirical experiments show that VDLM leads to significant improvements in state-of-the-art LMMs, such as GPT-4o, across various low-level multimodal perception and reasoning tasks on rasterized vector graphics. Additionally, we provide extensive analyses of VDLM’s performance, showing that our framework offers improved interpretability due to its disentangled perception and reasoning processes. We also conduct an in-depth error analysis, highlighting remaining limitations and suggesting directions for future research.

Original languageEnglish (US)
JournalTransactions on Machine Learning Research
Volume2025
StatePublished - 2025

ASJC Scopus subject areas

  • Artificial Intelligence
  • Computer Vision and Pattern Recognition

Fingerprint

Dive into the research topics of 'Visually Descriptive Language Model for Vector Graphics Reasoning'. Together they form a unique fingerprint.

Cite this