Abstract
In the recent years, the Web has been rapidly "deepened" with the prevalence of databases online. On this deep Web, many sources are structured by providing structured query interfaces and results. Organizing such structured sources into a domain hierarchy is one of the critical steps toward the integration of heterogeneous Web sources. We observe that, for structured Web sources, query schemas (i.e., attributes in query interfaces) are discriminative representatives of the sources and thus can be exploited for source characterization. In particular, by viewing query schemas as a type of categorical data, we abstract the problem of source organization into the clustering of categorical data. Our approach hypothesizes that "homogeneous sources" are characterized by the same hidden generative models for their schemas. To find clusters governed by such statistical distributions, we propose a new objective function, model-differentiation, which employs principled hypothesis testing to maximize statistical heterogeneity among clusters. Our evaluation over hundreds of real sources indicates that (1) the schema-based clustering accurately organizes sources by object domains (e.g., Books, Movies), and (2) on clustering Web query schemas, the model-differentiation function outperforms existing ones, such as likelihood, entropy, and context linkages, with the hierarchical agglomerative clustering algorithm.
Original language | English (US) |
---|---|
Pages | 22-31 |
Number of pages | 10 |
State | Published - 2004 |
Event | CIKM 2004: Proceedings of the Thirteenth ACM Conference on Information and Knowledge Management - Washington, DC, United States Duration: Nov 8 2004 → Nov 13 2004 |
Other
Other | CIKM 2004: Proceedings of the Thirteenth ACM Conference on Information and Knowledge Management |
---|---|
Country/Territory | United States |
City | Washington, DC |
Period | 11/8/04 → 11/13/04 |
Keywords
- Data integration
- Deep Web
- Hierarchical agglomerative clustering
ASJC Scopus subject areas
- Decision Sciences(all)
- Business, Management and Accounting(all)