Better rules, fewer features: A semantic approach to selecting features from text

Catherine Blake, Wanda Pratt

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The choice of features used to represent a domain has a profound effect on the quality of the model produced; yet, few researchers have investigated the relationship between the features used to represent text and the quality of the final model. We explored this relationship for medical texts by comparing association rules based on features with three different semantic levels: (1) words (2) manually assigned keywords and (3) automatically selected medical concepts. Our preliminary findings indicate that bi-directional association rules based on concepts or keywords are more plausible and more useful than those based on word features. The concept and keyword representations also required 90% fewer features than the word representation. This drastic dimensionality reduction suggests that this approach is well suited to large textual corpus of medical text, such as parts of the Web.

Original languageEnglish (US)
Title of host publicationProceedings - 2001 IEEE International Conference on Data Mining, ICDM'01
Pages59-66
Number of pages8
StatePublished - Dec 1 2001
Externally publishedYes
Event1st IEEE International Conference on Data Mining, ICDM'01 - San Jose, CA, United States
Duration: Nov 29 2001Dec 2 2001

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
ISSN (Print)1550-4786

Other

Other1st IEEE International Conference on Data Mining, ICDM'01
Country/TerritoryUnited States
CitySan Jose, CA
Period11/29/0112/2/01

ASJC Scopus subject areas

  • Engineering(all)

Fingerprint

Dive into the research topics of 'Better rules, fewer features: A semantic approach to selecting features from text'. Together they form a unique fingerprint.

Cite this