Context-aware wrapping: Synchronized data extraction

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The deep Web presents a pressing need for integrating large numbers of dynamically evolving data sources. To be more automatic yet accurate in building an integration system, we observe two problems: First, across sequential tasks in integration, how can a wrapper (as an extraction task) consider the peer sources to facilitate the subsequent matching task? Second, across parallel sources, how can a wrapper leverage the peer wrappers or domain rules to enhance extraction accuracy? These issues, while seemingly unrelated, both boil down to the lack of "context awareness": Current automatic wrapper induction approaches generate a wrapper for one source at a time, in isolation, and thus inherently lack the awareness of the peer sources or domain knowledge in the context of integration. We propose the concept of context-aware wrappers that are amenable to matching and that can leverage peer wrappers or prior domain knowledge. Such context awareness inspires a synchronization framework to construct wrappers consistently and collaboratively across their mutual context. We draw the insight from turbo codes and develop the turbo syncer to interconnect extraction with matching, which together achieve context awareness in wrapping. Our experiments show that the turbo syncer can, on the one hand, enhance extraction consistency and thus increase matching accuracy (from 17-83% to 78-94% in F-measure) and, on the other hand, incorporate peer wrappers and domain knowledge seamlessly to reduce extraction errors (from 09-60% to 01-11%).

Original languageEnglish (US)
Title of host publication33rd International Conference on Very Large Data Bases, VLDB 2007 - Conference Proceedings
EditorsJohannes Gehrke, Christoph Koch, Minos Garofalakis, Karl Aberer, Carl-Christian Kanne, Erich J. Neuhold, Venkatesh Ganti, Wolfgang Klas, Chee-Yong Chan, Divesh Srivastava, Dana Florescu, Anand Deshpande
PublisherAssociation for Computing Machinery, Inc
Pages699-710
Number of pages12
ISBN (Electronic)9781595936493
StatePublished - Jan 1 2007
Event33rd International Conference on Very Large Data Bases, VLDB 2007 - Vienna, Austria
Duration: Sep 23 2007Sep 27 2007

Publication series

Name33rd International Conference on Very Large Data Bases, VLDB 2007 - Conference Proceedings

Other

Other33rd International Conference on Very Large Data Bases, VLDB 2007
CountryAustria
CityVienna
Period9/23/079/27/07

ASJC Scopus subject areas

  • Hardware and Architecture
  • Information Systems and Management
  • Information Systems
  • Software

Fingerprint Dive into the research topics of 'Context-aware wrapping: Synchronized data extraction'. Together they form a unique fingerprint.

Cite this