TY - JOUR
T1 - Repairing fractures between data using genetic programming-based feature extraction
T2 - A case study in cancer diagnosis
AU - Moreno-Torres, Jose G.
AU - Llorà, Xavier
AU - Goldberg, David E.
AU - Bhargava, Rohit
N1 - Funding Information:
Jose García Moreno-Torres was supported by a scholarship from ‘Obra Social la Caixa’ and is currently supported by a FPU grant from the Ministerio de Educación y Ciencia of the Spanish Government, and also by the KEEL project (TIN2008-06681-C06-01). Rohit Bhargava would like to acknowledge collaborators over the years, especially Dr. Stephen M. Hewitt and Dr. Ira W. Levin of the National Institutes of Health, for numerous useful discussions and guidance. Funding for this work was provided in part by University of Illinois Research Board and by the Department of Defense Prostate Cancer Research Program. This work was also funded in part by the National Center for Supercomputing Applications and the University of Illinois, under the auspices of the NCSA/UIUC faculty fellows program.
PY - 2013/2/10
Y1 - 2013/2/10
N2 - There is an underlying assumption on most model building processes: given a learned classifier, it should be usable to explain unseen data from the same given problem. Despite this seemingly reasonable assumption, when dealing with biological data it tends to fail; where classifiers built out of data generated using the same protocols in two different laboratories can lead to two different, non-interchangeable, classifiers. There are usually too many uncontrollable variables in the process of generating data in the lab and biological variations, and small differences can lead to very different data distributions, with a fracture between data. This paper presents a genetics-based machine learning approach that performs feature extraction on data from a lab to help increase the classification performance of an existing classifier that was built using the data from a different laboratory which uses the same protocols, while learning about the shape of the fractures between data that motivated the bad behavior. The experimental analysis over benchmark problems together with a real-world problem on prostate cancer diagnosis show the good behavior of the proposed algorithm.
AB - There is an underlying assumption on most model building processes: given a learned classifier, it should be usable to explain unseen data from the same given problem. Despite this seemingly reasonable assumption, when dealing with biological data it tends to fail; where classifiers built out of data generated using the same protocols in two different laboratories can lead to two different, non-interchangeable, classifiers. There are usually too many uncontrollable variables in the process of generating data in the lab and biological variations, and small differences can lead to very different data distributions, with a fracture between data. This paper presents a genetics-based machine learning approach that performs feature extraction on data from a lab to help increase the classification performance of an existing classifier that was built using the data from a different laboratory which uses the same protocols, while learning about the shape of the fractures between data that motivated the bad behavior. The experimental analysis over benchmark problems together with a real-world problem on prostate cancer diagnosis show the good behavior of the proposed algorithm.
KW - Biological data
KW - Cancer diagnosis
KW - Different laboratories
KW - Feature extraction
KW - Fractures between data
KW - Genetic programming
UR - http://www.scopus.com/inward/record.url?scp=84870054779&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84870054779&partnerID=8YFLogxK
U2 - 10.1016/j.ins.2010.09.018
DO - 10.1016/j.ins.2010.09.018
M3 - Article
AN - SCOPUS:84870054779
SN - 0020-0255
VL - 222
SP - 805
EP - 823
JO - Information Sciences
JF - Information Sciences
ER -