Status locality on the web: Implications for building focused collections

Gautam Pant, Padmini Srinivasan

Research output: Contribution to journalArticlepeer-review

Abstract

Topical locality on the Web is the notion that pages tend to link to other topically similar pages and that such similarity decays rapidly with link distance. This supports meaningful Web browsing and searching by information consumers. It also allows topical Web crawlers, programs that fetch pages by following hyperlinks, to harvest topical subsets of the Web for applications such as those in vertical search and business intelligence. We show that the Web exhibits another property that we call "status locality." It is based on the notion that pages tend to link to other pages of similar status (importance) and that this status similarity also decays rapidly with link distance. Analogous to topical locality, status locality may also be exploited by Web crawlers. Collections built by such crawlers include pages that are both topically relevant and also important. This capability is crucial because of the large numbers of Web pages addressing even niche topics. The challenge in exploiting status locality while crawling is that page importance (or status) is typically recognized through global measures computed by processing link data from billion of pages. In contrast, topical Web crawlers depend on local information based on previously downloaded pages. We solve this problem by using methods developed previously that utilize local characteristics of pages to estimate their global status. This leads to the design of new crawlers, specifically of utility-biased crawlers guided by a Cobb-Douglas utility function. Our crawler experiments show that status and topicality of Web collections present a trade-off. An adaptive version of our utility-biased crawler dynamically modifies output elasticities of topicality and status to create Web collections that maintain high average topicality. This can be done while simultaneously achieving significantly higher average status as compared to several benchmarks including a state-of-the-art topical crawler.

Original languageEnglish (US)
Pages (from-to)802-821
Number of pages20
JournalInformation Systems Research
Volume24
Issue number3
DOIs
StatePublished - 2013
Externally publishedYes

Keywords

  • Homophily
  • Predictive models
  • Status locality
  • Topical crawlers

ASJC Scopus subject areas

  • Management Information Systems
  • Information Systems
  • Computer Networks and Communications
  • Information Systems and Management
  • Library and Information Sciences

Fingerprint

Dive into the research topics of 'Status locality on the web: Implications for building focused collections'. Together they form a unique fingerprint.

Cite this