Organizing structured web sources by query schemas: A clustering approach

Bin He, Tao Tao, Kevin Chen-Chuan Chang

Research output: Contribution to conferencePaper

Abstract

In the recent years, the Web has been rapidly "deepened" with the prevalence of databases online. On this deep Web, many sources are structured by providing structured query interfaces and results. Organizing such structured sources into a domain hierarchy is one of the critical steps toward the integration of heterogeneous Web sources. We observe that, for structured Web sources, query schemas (i.e., attributes in query interfaces) are discriminative representatives of the sources and thus can be exploited for source characterization. In particular, by viewing query schemas as a type of categorical data, we abstract the problem of source organization into the clustering of categorical data. Our approach hypothesizes that "homogeneous sources" are characterized by the same hidden generative models for their schemas. To find clusters governed by such statistical distributions, we propose a new objective function, model-differentiation, which employs principled hypothesis testing to maximize statistical heterogeneity among clusters. Our evaluation over hundreds of real sources indicates that (1) the schema-based clustering accurately organizes sources by object domains (e.g., Books, Movies), and (2) on clustering Web query schemas, the model-differentiation function outperforms existing ones, such as likelihood, entropy, and context linkages, with the hierarchical agglomerative clustering algorithm.

Original languageEnglish (US)
Pages22-31
Number of pages10
StatePublished - Dec 1 2004
EventCIKM 2004: Proceedings of the Thirteenth ACM Conference on Information and Knowledge Management - Washington, DC, United States
Duration: Nov 8 2004Nov 13 2004

Other

OtherCIKM 2004: Proceedings of the Thirteenth ACM Conference on Information and Knowledge Management
CountryUnited States
CityWashington, DC
Period11/8/0411/13/04

Keywords

  • Data integration
  • Deep Web
  • Hierarchical agglomerative clustering

ASJC Scopus subject areas

  • Decision Sciences(all)
  • Business, Management and Accounting(all)

Fingerprint Dive into the research topics of 'Organizing structured web sources by query schemas: A clustering approach'. Together they form a unique fingerprint.

  • Cite this

    He, B., Tao, T., & Chang, K. C-C. (2004). Organizing structured web sources by query schemas: A clustering approach. 22-31. Paper presented at CIKM 2004: Proceedings of the Thirteenth ACM Conference on Information and Knowledge Management, Washington, DC, United States.