Abstract
Recently, the Web has been rapidly "deepened" by many searchable databases online, where data are hidden behind query forms. For modelling and integrating Web databases, the very first challenge is to understand what a query interface says- or what query capabilities a source supports. Such automatic extraction of interface semantics is challenging, as query forms are created autonomously. Our approach builds on the observation that, across myriad sources, query forms seem to reveal some "concerted structure," by sharing common building blocks. Toward this insight, we hypothesize the existence of a hidden syntax that guides the creation of query interfaces, albeit from different sources. This hypothesis effectively transforms query interfaces into a visual language with a non-prescribed grammar- and, thus, their semantic understanding a parsing problem. Such a paradigm enables principled solutions for both declaratively representing common patterns, by a derived grammar, and systematically interpreting query forms, by a global parsing mechanism. To realize this paradigm, we must address the challenges of a hypothetical syntax- that it is to be derived, and that it is secondary to the input. At the heart of our form extractor, we thus develop a 2P grammar and a best-effort parser, which together realize a parsing mechanism for a hypothetical syntax. Our experiments show the promise of this approach- it achieves above 85% accuracy for extracting query conditions across random sources.
Original language | English (US) |
---|---|
Pages (from-to) | 107-118 |
Number of pages | 12 |
Journal | Proceedings of the ACM SIGMOD International Conference on Management of Data |
State | Published - 2004 |
Event | Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2004 - Paris, France Duration: Jun 13 2004 → Jun 18 2004 |
ASJC Scopus subject areas
- Software
- Information Systems