Experiments in high-dimensional text categorization

Fred J. Damerau, Tong Zhang, Sholom M. Weiss, Nitin Indurkhya

Research output: Contribution to journalConference articlepeer-review

Abstract

We present results for automated text categorization of the Reuters-810000 collection of news stories. Our experiments use the entire one-year collection of 810,000 stories and the entire subject index. We divide the data into monthly groups and provide an initial benchmark of text categorization performance on the complete collection. Experimental results show that efficient sparse-feature implementations of linear methods and decision trees, using a global unstemmed dictionary, can readily handle applications of this size. Predictive performance is approximately as strong as the best results for the much smaller older Reuters collections. Detailed results are provided over time periods. It is shown that a smaller time horizon does not diminish predictive quality, implying reduced demands for retraining when sample size is large.

Original languageEnglish (US)
Pages (from-to)357-358
Number of pages2
JournalSIGIR Forum (ACM Special Interest Group on Information Retrieval)
DOIs
StatePublished - 2002
Externally publishedYes
EventProceedings of the Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval - Tampere, Finland
Duration: Aug 11 2002Aug 15 2002

ASJC Scopus subject areas

  • Management Information Systems
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Experiments in high-dimensional text categorization'. Together they form a unique fingerprint.

Cite this