Exploring structure and content on the web: Extraction and integration of the semi-structured web

Tim Weninger, Jiawei Han

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this tutorial we view the World Wide Web as a type of massive, decentralized database. At present, this "Web database" is presented in a manner largely devoid of any consistent meaning or schema. That is not to say that Web-data lacks an underlying organization; in fact, most Web content is generated from an underlying schema-bound, or otherwise structured database. Information extraction is generally concerned with the reconciliation of unstructured or semi-structured Web content with the neatly structured database paradigm. With this Web-database in hand, researchers and practitioners have recently begun developing mechanisms which return structured results in response to an unstructured query. These new developments are a product of (1) record, list and table extraction from large numbers of semi-structured Web pages, (2) integration of these disparate extraction results into a consistent form, and (3) analysis of the newly extracted and integrated Web data. Among the many fruits of this line of work is the ability for semi-structured Web data to enhance the search capabilities of a schema-bound database. Alternatively, structured database records have also been used to augment Web page collections typically used by Web search engines. We will cover several key technologies, and principles explored so far in the area of Web information extraction, search and exploration.

Original languageEnglish (US)
Title of host publicationWSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining
Pages779-780
Number of pages2
DOIs
StatePublished - 2013
Event6th ACM International Conference on Web Search and Data Mining, WSDM 2013 - Rome, Italy
Duration: Feb 4 2013Feb 8 2013

Publication series

NameWSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining

Other

Other6th ACM International Conference on Web Search and Data Mining, WSDM 2013
Country/TerritoryItaly
CityRome
Period2/4/132/8/13

Keywords

  • information extraction
  • information integration
  • semi-structured data

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Exploring structure and content on the web: Extraction and integration of the semi-structured web'. Together they form a unique fingerprint.

Cite this