Statistical Schema Matching across Web Query Interfaces

Research output: Contribution to journalConference articlepeer-review

Abstract

Schema matching is a critical problem for integrating heterogeneous information sources. Traditionally, the problem of matching multiple schemas has essentially relied on finding pairwise-attribute correspondence. This paper proposes a different approach, motivated by integrating large numbers of data sources on the Internet. On this "deep Web," we observe two distinguishing characteristics that offer a new view for considering schema matching: First, as the Web scales, there are ample sources that provide structured information in the same domains (e.g., books and automobiles). Second, while sources proliferate, their aggregate schema vocabulary tends to converge at a relatively small size. Motivated by these observations, we propose a new paradigm, statistical schema matching: Unlike traditional approaches using pairwise-attribute correspondence, we take a holistic approach to match all input schemas by finding an underlying generative schema model. We propose a general statistical framework MGS for such hidden model discovery, which consists of hypothesis modeling, generation, and selection. Further, we specialize the general framework to develop Algorithm MGSsd, targeting at synonym discovery, a canonical problem of schema matching, by designing and discovering a model that specifically captures synonym attributes. We demonstrate our approach over hundreds of real Web sources in four domains and the results show good accuracy.

Original languageEnglish (US)
Pages (from-to)217-228
Number of pages12
JournalProceedings of the ACM SIGMOD International Conference on Management of Data
DOIs
StatePublished - 2003
Event2003 ACM SIGMOD International Conference on Management of Data - San Diego, CA, United States
Duration: Jun 9 2003Jun 12 2003

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'Statistical Schema Matching across Web Query Interfaces'. Together they form a unique fingerprint.

Cite this