TY - JOUR
T1 - High-performance commercial data mining
T2 - A multistrategy machine learning application
AU - Hsu, William H.
AU - Welge, Michael
AU - Redman, Tom
AU - Clutter, David
N1 - Funding Information:
Support for this research was provided in part by Allstate Insurance Company and by the office of Naval Research under Grant N00014-00-1-0769. We thank Tilt Thompkins and William M. Pottenger of NCSA for administering the Allstate One Company project and for developing the rule simulations that made the machine learning research reported in this paper possible. In addition, we thank the following NCSA staff members for their contributions to the project: Yuching Ni, NCSA’s consultant to Allstate; and our fellow researchers at the NCSA Automated Learning Group (ALG): Loretta S. Auvil, Colleen Bushell, Lisa Gatzke, and David Tcheng. We thank the following ALG students for their assistance with D2K and Jenesis development: Michael Bach, Russ Bader, Mike Perry, Kristopher Wuollett, Ting-Hao Yang, and Dav Zimak. We thank David E. Goldberg for insightful remarks regarding genetic algorithms for constructive induction and Larry A. Rendell for early discussions on inductive bias optimization. We also thank the director of the Allstate Underwriting Division, Joe Porter, for sharing his expertise and for his weekly participation in experimental discussions and visualization sessions. Finally, we thank the anonymous reviewers for beneficial comments regarding wrappers for performance tuning in KDD, hyperparameter (especially inductive bias parameter) optimization, and distributed computing performance and scalability issues.
PY - 2002
Y1 - 2002
N2 - We present an application of inductive concept learning and interactive visualization techniques to a large-scale commercial data mining project. This paper focuses on design and configuration of high-level optimization systems (wrappers) for relevance determination and constructive induction, and on integrating these wrappers with elicited knowledge on attribute relevance and synthesis. In particular, we discuss decision support issues for the application (cost prediction for automobile insurance markets in several states) and report experiments using D2K, a Java-based visual programming system for data mining and information visualization, and several commercial and research tools. We describe exploratory clustering, descriptive statistics, and supervised decision tree learning in this application, focusing on a parallel genetic algorithm (GA) system, Jenesis, which is used to implement relevance determination (attribute subset selection). Deployed on several high-performance network-of-workstation systems (Beowulf clusters), Jenesis achieves a linear speedup, due to a high degree of task parallelism. Its test set accuracy is significantly higher than that of decision tree inducers alone and is comparable to that of the best extant search-space based wrappers.
AB - We present an application of inductive concept learning and interactive visualization techniques to a large-scale commercial data mining project. This paper focuses on design and configuration of high-level optimization systems (wrappers) for relevance determination and constructive induction, and on integrating these wrappers with elicited knowledge on attribute relevance and synthesis. In particular, we discuss decision support issues for the application (cost prediction for automobile insurance markets in several states) and report experiments using D2K, a Java-based visual programming system for data mining and information visualization, and several commercial and research tools. We describe exploratory clustering, descriptive statistics, and supervised decision tree learning in this application, focusing on a parallel genetic algorithm (GA) system, Jenesis, which is used to implement relevance determination (attribute subset selection). Deployed on several high-performance network-of-workstation systems (Beowulf clusters), Jenesis achieves a linear speedup, due to a high degree of task parallelism. Its test set accuracy is significantly higher than that of decision tree inducers alone and is comparable to that of the best extant search-space based wrappers.
KW - Constructive induction
KW - Genetic algorithms
KW - Real-world decision support applications
KW - Relevance determination
KW - Scalable high-performance computing
KW - Software development environments for knowledge discovery in databases (KDD)
UR - http://www.scopus.com/inward/record.url?scp=0141799996&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0141799996&partnerID=8YFLogxK
U2 - 10.1023/A:1016352221465
DO - 10.1023/A:1016352221465
M3 - Review article
AN - SCOPUS:0141799996
SN - 1384-5810
VL - 6
SP - 361
EP - 391
JO - Data Mining and Knowledge Discovery
JF - Data Mining and Knowledge Discovery
IS - 4
ER -