TY - GEN
T1 - Context-aware wrapping
T2 - 33rd International Conference on Very Large Data Bases, VLDB 2007
AU - Chuang, Shui Lung
AU - Chang, Kevin Chen Chuan
AU - Zhai, Cheng Xiang
N1 - Funding Information:
This material is based upon the work partially supported by NSF Grants IIS-0133199, IIS-0313260, the 2004 and 2005 IBM Faculty Awards. Any opinions, findings, and conclusions or recommendations expressed in this publication are those of the author(s) and do not necessarily reflect the views of the funding agencies.
Publisher Copyright:
Copyright 2007 VLDB Endowment, ACM.
PY - 2007
Y1 - 2007
N2 - The deep Web presents a pressing need for integrating large numbers of dynamically evolving data sources. To be more automatic yet accurate in building an integration system, we observe two problems: First, across sequential tasks in integration, how can a wrapper (as an extraction task) consider the peer sources to facilitate the subsequent matching task? Second, across parallel sources, how can a wrapper leverage the peer wrappers or domain rules to enhance extraction accuracy? These issues, while seemingly unrelated, both boil down to the lack of "context awareness": Current automatic wrapper induction approaches generate a wrapper for one source at a time, in isolation, and thus inherently lack the awareness of the peer sources or domain knowledge in the context of integration. We propose the concept of context-aware wrappers that are amenable to matching and that can leverage peer wrappers or prior domain knowledge. Such context awareness inspires a synchronization framework to construct wrappers consistently and collaboratively across their mutual context. We draw the insight from turbo codes and develop the turbo syncer to interconnect extraction with matching, which together achieve context awareness in wrapping. Our experiments show that the turbo syncer can, on the one hand, enhance extraction consistency and thus increase matching accuracy (from 17-83% to 78-94% in F-measure) and, on the other hand, incorporate peer wrappers and domain knowledge seamlessly to reduce extraction errors (from 09-60% to 01-11%).
AB - The deep Web presents a pressing need for integrating large numbers of dynamically evolving data sources. To be more automatic yet accurate in building an integration system, we observe two problems: First, across sequential tasks in integration, how can a wrapper (as an extraction task) consider the peer sources to facilitate the subsequent matching task? Second, across parallel sources, how can a wrapper leverage the peer wrappers or domain rules to enhance extraction accuracy? These issues, while seemingly unrelated, both boil down to the lack of "context awareness": Current automatic wrapper induction approaches generate a wrapper for one source at a time, in isolation, and thus inherently lack the awareness of the peer sources or domain knowledge in the context of integration. We propose the concept of context-aware wrappers that are amenable to matching and that can leverage peer wrappers or prior domain knowledge. Such context awareness inspires a synchronization framework to construct wrappers consistently and collaboratively across their mutual context. We draw the insight from turbo codes and develop the turbo syncer to interconnect extraction with matching, which together achieve context awareness in wrapping. Our experiments show that the turbo syncer can, on the one hand, enhance extraction consistency and thus increase matching accuracy (from 17-83% to 78-94% in F-measure) and, on the other hand, incorporate peer wrappers and domain knowledge seamlessly to reduce extraction errors (from 09-60% to 01-11%).
UR - http://www.scopus.com/inward/record.url?scp=85011016482&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85011016482&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85011016482
T3 - 33rd International Conference on Very Large Data Bases, VLDB 2007 - Conference Proceedings
SP - 699
EP - 710
BT - 33rd International Conference on Very Large Data Bases, VLDB 2007 - Conference Proceedings
A2 - Gehrke, Johannes
A2 - Koch, Christoph
A2 - Garofalakis, Minos
A2 - Aberer, Karl
A2 - Kanne, Carl-Christian
A2 - Neuhold, Erich J.
A2 - Ganti, Venkatesh
A2 - Klas, Wolfgang
A2 - Chan, Chee-Yong
A2 - Srivastava, Divesh
A2 - Florescu, Dana
A2 - Deshpande, Anand
PB - Association for Computing Machinery, Inc
Y2 - 23 September 2007 through 27 September 2007
ER -