Utility-biased web crawler: Combining status and topicality

Gautam Pant, Padmini Srinivasan

Research output: Contribution to conferencePaperpeer-review

Abstract

It remains challenging to support increasingly sophisticated vertical search and business intelligence applications on the Web. A key aspect of the challenge is about how the underlying web crawler should harvest and process appropriate information. The focus of these applications, unlike of general-purpose search engines, is not on exhaustively gathering information available on the Web but rather to hone in on the subset (oftentimes small) of the Web that offers the highest utility. This utility relies, at least in part, on the topical relevance of the pages gathered (e.g., pages on investing strategies are topically relevant for a vertical search engine serving Wall Street clientele). However, given the large number of web pages addressing even niche topics, it is necessary for these applications to further focus on the most important of the topically relevant web pages. In this paper we propose a web crawler that is designed to support such applications. The web collection built by our proposed utility-biased web crawler is guided by a Cobb-Douglas utility function that incorporates both the topicality of pages encountered as well as the global status (importance) of those pages. The latter component of Cobb-Douglas utility (i.e., the page status) is especially hard to estimate for a crawler due to unavailability of global information. The utility-biased crawler estimates the topicality as well as the status based on local features of the page while applying SVM and decision tree algorithms. We find that status and topicality of web collections present a tradeoff. However, the utility-biased crawler, with appropriate output elasticities of topicality and status, can create web collections that maintain high average topicality while also achieving significantly higher average status than a traditional topical crawler.

Original languageEnglish (US)
Pages109-114
Number of pages6
StatePublished - 2008
Externally publishedYes
Event2008 Workshop on Information Technologies and Systems, WITS 2008 - Paris, France
Duration: Dec 13 2008Dec 14 2008

Conference

Conference2008 Workshop on Information Technologies and Systems, WITS 2008
Country/TerritoryFrance
CityParis
Period12/13/0812/14/08

ASJC Scopus subject areas

  • Information Systems
  • Control and Systems Engineering

Fingerprint

Dive into the research topics of 'Utility-biased web crawler: Combining status and topicality'. Together they form a unique fingerprint.

Cite this