Building enriched web page representations using link paths

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Anchor text has a history of enriching documents for a variety of tasks within the World Wide Web. Anchor texts are useful because they are similar to typical Web queries, and because they express the document's context. Therefore, it is a common practice for Web search engines to incorporate incoming anchor text into the document's standard textual representation. However, this approach will not suffice for documents with very few inlinks, and it does not incorporate the document's full context. To mediate these problems, we employ link paths, which contain anchor texts from paths through the Web ending at the document in question. We propose and study several different ways to aggregate anchor text from link paths, and we show that the information from link paths can be used to (1) improve known item search in site-specific search, and (2) map Web pages to database records. We rigorously evaluate our proposed approach on several real world test collections. We find that our approach significantly improves performance over baseline and existing techniques in both tasks.

Original languageEnglish (US)
Title of host publicationHT'12 - Proceedings of 23rd ACM Conference on Hypertext and Social Media
Pages53-62
Number of pages10
DOIs
StatePublished - Jul 25 2012
Event23rd ACM Conference on Hypertext and Social Media, HT'12 - Milwaukee, WI, United States
Duration: Jun 25 2012Jun 28 2012

Publication series

NameHT'12 - Proceedings of 23rd ACM Conference on Hypertext and Social Media

Other

Other23rd ACM Conference on Hypertext and Social Media, HT'12
CountryUnited States
CityMilwaukee, WI
Period6/25/126/28/12

Keywords

  • Anchor text
  • Document indexing
  • Link paths
  • Record linkage
  • Web

ASJC Scopus subject areas

  • Artificial Intelligence
  • Human-Computer Interaction
  • Software

Fingerprint Dive into the research topics of 'Building enriched web page representations using link paths'. Together they form a unique fingerprint.

Cite this