Understanding web query interfaces: Best-effort parsing with hidden syntax

Zhen Zhang, Bin He, Kevin Chen-Chuan Chang

Research output: Contribution to journalConference articlepeer-review


Recently, the Web has been rapidly "deepened" by many searchable databases online, where data are hidden behind query forms. For modelling and integrating Web databases, the very first challenge is to understand what a query interface says- or what query capabilities a source supports. Such automatic extraction of interface semantics is challenging, as query forms are created autonomously. Our approach builds on the observation that, across myriad sources, query forms seem to reveal some "concerted structure," by sharing common building blocks. Toward this insight, we hypothesize the existence of a hidden syntax that guides the creation of query interfaces, albeit from different sources. This hypothesis effectively transforms query interfaces into a visual language with a non-prescribed grammar- and, thus, their semantic understanding a parsing problem. Such a paradigm enables principled solutions for both declaratively representing common patterns, by a derived grammar, and systematically interpreting query forms, by a global parsing mechanism. To realize this paradigm, we must address the challenges of a hypothetical syntax- that it is to be derived, and that it is secondary to the input. At the heart of our form extractor, we thus develop a 2P grammar and a best-effort parser, which together realize a parsing mechanism for a hypothetical syntax. Our experiments show the promise of this approach- it achieves above 85% accuracy for extracting query conditions across random sources.

Original languageEnglish (US)
Pages (from-to)107-118
Number of pages12
JournalProceedings of the ACM SIGMOD International Conference on Management of Data
StatePublished - Jul 27 2004
EventProceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2004 - Paris, France
Duration: Jun 13 2004Jun 18 2004

ASJC Scopus subject areas

  • Software
  • Information Systems


Dive into the research topics of 'Understanding web query interfaces: Best-effort parsing with hidden syntax'. Together they form a unique fingerprint.

Cite this