Learning to crawl: Comparing classification schemes

Gautam Pant, Padmini Srinivasan

Research output: Contribution to journalArticlepeer-review


Topical crawling is a young and creative area of research that holds the promise of benefiting from several sophisticated data mining techniques. The use of classification algorithms to guide topical crawlers has been sporadically suggested in the literature. No systematic study, however, has been done on their relative merits. Using the lessons learned from our previous crawler evaluation studies, we experiment with multiple versions of different classification schemes. The crawling process is modeled as a parallel best-first search over a graph defined by the Web. The classifiers provide heuristics to the crawler thus biasing it towards certain portions of the Web graph. Our results show that Naive Bayes is a weak choice for guiding a topical crawler when compared with Support Vector Machine or Neural Network. Further, the weak performance of Naive Bayes can be partly explained by extreme skewness of posterior probabilities generated by it. We also observe that despite similar performances, different topical crawlers cover subspaces on the Web with low overlap.

Original languageEnglish (US)
Pages (from-to)430-462
Number of pages33
JournalACM Transactions on Information Systems
Issue number4
StatePublished - 2005
Externally publishedYes


  • Classifiers
  • Focused crawlers
  • Machine learning
  • Topical crawlers

ASJC Scopus subject areas

  • Information Systems
  • Business, Management and Accounting(all)
  • Computer Science Applications


Dive into the research topics of 'Learning to crawl: Comparing classification schemes'. Together they form a unique fingerprint.

Cite this