TY - GEN
T1 - Exploring structure and content on the web
T2 - 6th ACM International Conference on Web Search and Data Mining, WSDM 2013
AU - Weninger, Tim
AU - Han, Jiawei
PY - 2013
Y1 - 2013
N2 - In this tutorial we view the World Wide Web as a type of massive, decentralized database. At present, this "Web database" is presented in a manner largely devoid of any consistent meaning or schema. That is not to say that Web-data lacks an underlying organization; in fact, most Web content is generated from an underlying schema-bound, or otherwise structured database. Information extraction is generally concerned with the reconciliation of unstructured or semi-structured Web content with the neatly structured database paradigm. With this Web-database in hand, researchers and practitioners have recently begun developing mechanisms which return structured results in response to an unstructured query. These new developments are a product of (1) record, list and table extraction from large numbers of semi-structured Web pages, (2) integration of these disparate extraction results into a consistent form, and (3) analysis of the newly extracted and integrated Web data. Among the many fruits of this line of work is the ability for semi-structured Web data to enhance the search capabilities of a schema-bound database. Alternatively, structured database records have also been used to augment Web page collections typically used by Web search engines. We will cover several key technologies, and principles explored so far in the area of Web information extraction, search and exploration.
AB - In this tutorial we view the World Wide Web as a type of massive, decentralized database. At present, this "Web database" is presented in a manner largely devoid of any consistent meaning or schema. That is not to say that Web-data lacks an underlying organization; in fact, most Web content is generated from an underlying schema-bound, or otherwise structured database. Information extraction is generally concerned with the reconciliation of unstructured or semi-structured Web content with the neatly structured database paradigm. With this Web-database in hand, researchers and practitioners have recently begun developing mechanisms which return structured results in response to an unstructured query. These new developments are a product of (1) record, list and table extraction from large numbers of semi-structured Web pages, (2) integration of these disparate extraction results into a consistent form, and (3) analysis of the newly extracted and integrated Web data. Among the many fruits of this line of work is the ability for semi-structured Web data to enhance the search capabilities of a schema-bound database. Alternatively, structured database records have also been used to augment Web page collections typically used by Web search engines. We will cover several key technologies, and principles explored so far in the area of Web information extraction, search and exploration.
KW - information extraction
KW - information integration
KW - semi-structured data
UR - http://www.scopus.com/inward/record.url?scp=84874231558&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84874231558&partnerID=8YFLogxK
U2 - 10.1145/2433396.2433499
DO - 10.1145/2433396.2433499
M3 - Conference contribution
AN - SCOPUS:84874231558
SN - 9781450318693
T3 - WSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining
SP - 779
EP - 780
BT - WSDM 2013 - Proceedings of the 6th ACM International Conference on Web Search and Data Mining
Y2 - 4 February 2013 through 8 February 2013
ER -