TY - JOUR
T1 - The parallel path framework for entity discovery on the Web
AU - Weninger, Tim
AU - Johnston, Thomas J.
AU - Han, Jiawei
PY - 2013
Y1 - 2013
N2 - It has been a dream of the database and Web communities to reconcile the unstructured nature of the World Wide Web with the neat, structured schemas of the database paradigm. Even though databases are currently used to generate Web content in some sites, the schemas of these databases are rarely consistent across a domain. This makes the comparison and aggregation of information from different domains difficult. We aim to make an important step towards resolving this disparity by using the structural and relational information on the Web to (1) extract Web lists, (2) find entity-pages, (3) map entity-pages to a database, and (4) extract attributes of the entities. Specifically, given a Web site and an entity-page (e.g., university department and faculty member home page) we seek to find all of the entity-pages of the same type (e.g., all faculty members in the department), as well as attributes of the specific entities (e.g., their phone numbers, email addresses, office numbers). To do this, we propose aWeb structureminingmethod which grows parallel paths through the Web graph and DOM trees and propagates relevant attribute information forward. We show that by utilizing these parallel paths we can efficiently discover entity-pages and attributes. Finally, we demonstrate the accuracy of our method with a large case study.
AB - It has been a dream of the database and Web communities to reconcile the unstructured nature of the World Wide Web with the neat, structured schemas of the database paradigm. Even though databases are currently used to generate Web content in some sites, the schemas of these databases are rarely consistent across a domain. This makes the comparison and aggregation of information from different domains difficult. We aim to make an important step towards resolving this disparity by using the structural and relational information on the Web to (1) extract Web lists, (2) find entity-pages, (3) map entity-pages to a database, and (4) extract attributes of the entities. Specifically, given a Web site and an entity-page (e.g., university department and faculty member home page) we seek to find all of the entity-pages of the same type (e.g., all faculty members in the department), as well as attributes of the specific entities (e.g., their phone numbers, email addresses, office numbers). To do this, we propose aWeb structureminingmethod which grows parallel paths through the Web graph and DOM trees and propagates relevant attribute information forward. We show that by utilizing these parallel paths we can efficiently discover entity-pages and attributes. Finally, we demonstrate the accuracy of our method with a large case study.
KW - Entity pages
KW - Parallel paths
KW - Semi-structured data
KW - Web structure mining
UR - http://www.scopus.com/inward/record.url?scp=84885653219&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84885653219&partnerID=8YFLogxK
U2 - 10.1145/2516633.2516638
DO - 10.1145/2516633.2516638
M3 - Article
AN - SCOPUS:84885653219
SN - 1559-1131
VL - 7
JO - ACM Transactions on the Web
JF - ACM Transactions on the Web
IS - 3
M1 - 16
ER -