TY - GEN
T1 - Integrating web query results
T2 - 17th ACM Conference on Information and Knowledge Management, CIKM'08
AU - Chuang, Shui Lung
AU - Chang, Kevin Chen Chuan
PY - 2008
Y1 - 2008
N2 - The emergence of numerous data sources online has presented a pressing need for more automatic yet accurate data integration techniques. For the data returned from querying such sources, most works focus on how to extract the embedded structured data more accurately. However, to eventually provide an integrated access to these query results, a last but not least step is to combine the extracted data coming from different sources. A critical task is finding the correspondence of the data fields between the sourcesa problem well known as schema matching. Query results are a small and biased sample set of instances obtained from sources; the obtained schema information is thus very implicit and incomplete, which often prevents existing schema matching approaches from performing effectively. In this paper, we develop a novel framework for understanding and effectively supporting schema matching on such instancebased data, especially for integrating multiple sources. We view discovering matching as constructing a more complete domain schema that best describes the input data. With this conceptual view, we can leverage various data instances and observed regularities seamlessly with holistic, multiplesource schema matching to achieve more accurate matching results. Our experiments show that our framework consistently outperforms baseline pairwise and clustering-based approaches (raising F-measure from 50-89% to 89-94%) and works uniformly well for the surveyed domains.
AB - The emergence of numerous data sources online has presented a pressing need for more automatic yet accurate data integration techniques. For the data returned from querying such sources, most works focus on how to extract the embedded structured data more accurately. However, to eventually provide an integrated access to these query results, a last but not least step is to combine the extracted data coming from different sources. A critical task is finding the correspondence of the data fields between the sourcesa problem well known as schema matching. Query results are a small and biased sample set of instances obtained from sources; the obtained schema information is thus very implicit and incomplete, which often prevents existing schema matching approaches from performing effectively. In this paper, we develop a novel framework for understanding and effectively supporting schema matching on such instancebased data, especially for integrating multiple sources. We view discovering matching as constructing a more complete domain schema that best describes the input data. With this conceptual view, we can leverage various data instances and observed regularities seamlessly with holistic, multiplesource schema matching to achieve more accurate matching results. Our experiments show that our framework consistently outperforms baseline pairwise and clustering-based approaches (raising F-measure from 50-89% to 89-94%) and works uniformly well for the surveyed domains.
UR - http://www.scopus.com/inward/record.url?scp=70349230065&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=70349230065&partnerID=8YFLogxK
U2 - 10.1145/1458082.1458090
DO - 10.1145/1458082.1458090
M3 - Conference contribution
AN - SCOPUS:70349230065
SN - 9781595939913
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 33
EP - 42
BT - Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM'08
Y2 - 26 October 2008 through 30 October 2008
ER -