Extracting general lists from web documents: A hybrid approach

Fabio Fumarola, Tim Weninger, Rick Barber, Donato Malerba, Jiawei Han

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The problem of extracting structured data (i.e. lists, record sets, tables, etc.) from the Web has been traditionally approached by taking into account either the underlying markup structure of a Web page or the visual structure of the Web page. However, empirical results show that considering the HTML structure and visual cues of a Web page independently do not generalize well. We propose a new hybrid method to extract general lists from the Web. It employs both general assumptions on the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods across a varied Web corpus.

Original languageEnglish (US)
Title of host publicationModern Approaches in Applied Intelligence - 24th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2011, Proceedings
Pages285-294
Number of pages10
EditionPART 1
DOIs
StatePublished - 2011
Event24th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2011 - Syracuse, NY, United States
Duration: Jun 28 2011Jul 1 2011

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
NumberPART 1
Volume6703 LNAI
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other24th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2011
Country/TerritoryUnited States
CitySyracuse, NY
Period6/28/117/1/11

Keywords

  • Web information integration
  • Web lists
  • Web mining

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Fingerprint

Dive into the research topics of 'Extracting general lists from web documents: A hybrid approach'. Together they form a unique fingerprint.

Cite this