TY - JOUR
T1 - Efficient classification across multiple database relations
T2 - A crossMine approach
AU - Yin, Xiaoxin
AU - Han, Jiawei
AU - Yang, J.
AU - Yu, P. S.
N1 - Funding Information:
The work was supported in part by the US National Science Foundation under grants IIS-02-09199 and IIS-03-08215, the University of Illinois, and an IBM Faculty Award. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies. This paper is based on X. Yin, J. Han, J. Yang, and P.S. Yu, “CrossMine: Efficient Classification across Multiple Database Relations,” Proceedings of the 2004 International Conference on Data Engineering (ICDE ’04), pp. 399-410, Mar. 2004. In comparison with the conference paper, this manuscript contains major new technical value. For details, please see the submission note at the end of the paper.
PY - 2006/6/1
Y1 - 2006/6/1
N2 - Relational databases are the most popular repository for structured data, and is thus one of the richest sources of knowledge in the world. In a relational database, multiple relations are linked together via entity-relationship links. Multirelational classification is the procedure of building a classifier based on information stored in multiple relations and making predictions with it. Existing approaches of Inductive Logic Programming (recently, also known as Relational Mining) have proven effective with high accuracy in multirelational classification. Unfortunately, most of them suffer from scalability problems with regard to the number of relations in databases. In this paper, we propose a new approach, called CrossMine, which includes a set of novel and powerful methods for multirelational classification, including 1) tuple ID propagation, an efficient and flexible method for virtually joining relations, which enables convenient search among different relations, 2) new definitions for predicates and decision-tree nodes, which involve aggregated information to provide essential statistics for classification, and 3) a selective sampling method for improving scalability with regard to the number of tuples. Based on these techniques, we propose two scalable and accurate methods for multi relational classification: CrossMine-Rule, a rule-based method and CrossMine-Tree, a decision-tree-based method. Our comprehensive experiments on both real and synthetic data sets demonstrate the high scalability and accuracy of the CrossMine approach.
AB - Relational databases are the most popular repository for structured data, and is thus one of the richest sources of knowledge in the world. In a relational database, multiple relations are linked together via entity-relationship links. Multirelational classification is the procedure of building a classifier based on information stored in multiple relations and making predictions with it. Existing approaches of Inductive Logic Programming (recently, also known as Relational Mining) have proven effective with high accuracy in multirelational classification. Unfortunately, most of them suffer from scalability problems with regard to the number of relations in databases. In this paper, we propose a new approach, called CrossMine, which includes a set of novel and powerful methods for multirelational classification, including 1) tuple ID propagation, an efficient and flexible method for virtually joining relations, which enables convenient search among different relations, 2) new definitions for predicates and decision-tree nodes, which involve aggregated information to provide essential statistics for classification, and 3) a selective sampling method for improving scalability with regard to the number of tuples. Based on these techniques, we propose two scalable and accurate methods for multi relational classification: CrossMine-Rule, a rule-based method and CrossMine-Tree, a decision-tree-based method. Our comprehensive experiments on both real and synthetic data sets demonstrate the high scalability and accuracy of the CrossMine approach.
KW - Classification
KW - Data mining
KW - Relational databases
UR - http://www.scopus.com/inward/record.url?scp=33646417316&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=33646417316&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2006.94
DO - 10.1109/TKDE.2006.94
M3 - Article
AN - SCOPUS:33646417316
SN - 1041-4347
VL - 18
SP - 770
EP - 783
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 6
ER -