Repairing fractures between data using genetic programming-based feature extraction: A case study in cancer diagnosis

Jose G. Moreno-Torres, Xavier Llorà, David E. Goldberg, Rohit Bhargava

Research output: Contribution to journalArticle

Abstract

There is an underlying assumption on most model building processes: given a learned classifier, it should be usable to explain unseen data from the same given problem. Despite this seemingly reasonable assumption, when dealing with biological data it tends to fail; where classifiers built out of data generated using the same protocols in two different laboratories can lead to two different, non-interchangeable, classifiers. There are usually too many uncontrollable variables in the process of generating data in the lab and biological variations, and small differences can lead to very different data distributions, with a fracture between data. This paper presents a genetics-based machine learning approach that performs feature extraction on data from a lab to help increase the classification performance of an existing classifier that was built using the data from a different laboratory which uses the same protocols, while learning about the shape of the fractures between data that motivated the bad behavior. The experimental analysis over benchmark problems together with a real-world problem on prostate cancer diagnosis show the good behavior of the proposed algorithm.

Original languageEnglish (US)
Pages (from-to)805-823
Number of pages19
JournalInformation Sciences
Volume222
DOIs
StatePublished - Feb 10 2013

Fingerprint

Genetic programming
Genetic Programming
Feature Extraction
Feature extraction
Cancer
Classifiers
Classifier
Network protocols
Learning systems
Genetics-based Machine Learning
Prostate Cancer
Data Distribution
Experimental Analysis
Tend
Benchmark

Keywords

  • Biological data
  • Cancer diagnosis
  • Different laboratories
  • Feature extraction
  • Fractures between data
  • Genetic programming

ASJC Scopus subject areas

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Theoretical Computer Science
  • Computer Science Applications
  • Information Systems and Management

Cite this

Repairing fractures between data using genetic programming-based feature extraction : A case study in cancer diagnosis. / Moreno-Torres, Jose G.; Llorà, Xavier; Goldberg, David E.; Bhargava, Rohit.

In: Information Sciences, Vol. 222, 10.02.2013, p. 805-823.

Research output: Contribution to journalArticle

@article{ecb664eafa4949e5a05f79ae110ca967,
title = "Repairing fractures between data using genetic programming-based feature extraction: A case study in cancer diagnosis",
abstract = "There is an underlying assumption on most model building processes: given a learned classifier, it should be usable to explain unseen data from the same given problem. Despite this seemingly reasonable assumption, when dealing with biological data it tends to fail; where classifiers built out of data generated using the same protocols in two different laboratories can lead to two different, non-interchangeable, classifiers. There are usually too many uncontrollable variables in the process of generating data in the lab and biological variations, and small differences can lead to very different data distributions, with a fracture between data. This paper presents a genetics-based machine learning approach that performs feature extraction on data from a lab to help increase the classification performance of an existing classifier that was built using the data from a different laboratory which uses the same protocols, while learning about the shape of the fractures between data that motivated the bad behavior. The experimental analysis over benchmark problems together with a real-world problem on prostate cancer diagnosis show the good behavior of the proposed algorithm.",
keywords = "Biological data, Cancer diagnosis, Different laboratories, Feature extraction, Fractures between data, Genetic programming",
author = "Moreno-Torres, {Jose G.} and Xavier Llor{\`a} and Goldberg, {David E.} and Rohit Bhargava",
year = "2013",
month = "2",
day = "10",
doi = "10.1016/j.ins.2010.09.018",
language = "English (US)",
volume = "222",
pages = "805--823",
journal = "Information Sciences",
issn = "0020-0255",
publisher = "Elsevier Inc.",

}

TY - JOUR

T1 - Repairing fractures between data using genetic programming-based feature extraction

T2 - A case study in cancer diagnosis

AU - Moreno-Torres, Jose G.

AU - Llorà, Xavier

AU - Goldberg, David E.

AU - Bhargava, Rohit

PY - 2013/2/10

Y1 - 2013/2/10

N2 - There is an underlying assumption on most model building processes: given a learned classifier, it should be usable to explain unseen data from the same given problem. Despite this seemingly reasonable assumption, when dealing with biological data it tends to fail; where classifiers built out of data generated using the same protocols in two different laboratories can lead to two different, non-interchangeable, classifiers. There are usually too many uncontrollable variables in the process of generating data in the lab and biological variations, and small differences can lead to very different data distributions, with a fracture between data. This paper presents a genetics-based machine learning approach that performs feature extraction on data from a lab to help increase the classification performance of an existing classifier that was built using the data from a different laboratory which uses the same protocols, while learning about the shape of the fractures between data that motivated the bad behavior. The experimental analysis over benchmark problems together with a real-world problem on prostate cancer diagnosis show the good behavior of the proposed algorithm.

AB - There is an underlying assumption on most model building processes: given a learned classifier, it should be usable to explain unseen data from the same given problem. Despite this seemingly reasonable assumption, when dealing with biological data it tends to fail; where classifiers built out of data generated using the same protocols in two different laboratories can lead to two different, non-interchangeable, classifiers. There are usually too many uncontrollable variables in the process of generating data in the lab and biological variations, and small differences can lead to very different data distributions, with a fracture between data. This paper presents a genetics-based machine learning approach that performs feature extraction on data from a lab to help increase the classification performance of an existing classifier that was built using the data from a different laboratory which uses the same protocols, while learning about the shape of the fractures between data that motivated the bad behavior. The experimental analysis over benchmark problems together with a real-world problem on prostate cancer diagnosis show the good behavior of the proposed algorithm.

KW - Biological data

KW - Cancer diagnosis

KW - Different laboratories

KW - Feature extraction

KW - Fractures between data

KW - Genetic programming

UR - http://www.scopus.com/inward/record.url?scp=84870054779&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84870054779&partnerID=8YFLogxK

U2 - 10.1016/j.ins.2010.09.018

DO - 10.1016/j.ins.2010.09.018

M3 - Article

AN - SCOPUS:84870054779

VL - 222

SP - 805

EP - 823

JO - Information Sciences

JF - Information Sciences

SN - 0020-0255

ER -