TY - GEN
T1 - Extracting general lists from web documents
T2 - 24th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2011
AU - Fumarola, Fabio
AU - Weninger, Tim
AU - Barber, Rick
AU - Malerba, Donato
AU - Han, Jiawei
PY - 2011
Y1 - 2011
N2 - The problem of extracting structured data (i.e. lists, record sets, tables, etc.) from the Web has been traditionally approached by taking into account either the underlying markup structure of a Web page or the visual structure of the Web page. However, empirical results show that considering the HTML structure and visual cues of a Web page independently do not generalize well. We propose a new hybrid method to extract general lists from the Web. It employs both general assumptions on the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods across a varied Web corpus.
AB - The problem of extracting structured data (i.e. lists, record sets, tables, etc.) from the Web has been traditionally approached by taking into account either the underlying markup structure of a Web page or the visual structure of the Web page. However, empirical results show that considering the HTML structure and visual cues of a Web page independently do not generalize well. We propose a new hybrid method to extract general lists from the Web. It employs both general assumptions on the visual rendering of lists, and the structural representation of items contained in them. We show that our method significantly outperforms existing methods across a varied Web corpus.
KW - Web information integration
KW - Web lists
KW - Web mining
UR - http://www.scopus.com/inward/record.url?scp=79960507022&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79960507022&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-21822-4_29
DO - 10.1007/978-3-642-21822-4_29
M3 - Conference contribution
AN - SCOPUS:79960507022
SN - 9783642218217
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 285
EP - 294
BT - Modern Approaches in Applied Intelligence - 24th International Conference on Industrial Engineering and Other Applications of Applied Intelligent Systems, IEA/AIE 2011, Proceedings
Y2 - 28 June 2011 through 1 July 2011
ER -