CrossMine: Efficient classification across multiple database relations

Xiaoxin Yin, Jiawei Han, Jiong Yang, Philip S. Yu

Research output: Contribution to conferencePaper

Abstract

Most of today's structured data is stored in relational databases. Such a database consists of multiple relations which are linked together conceptually via entity-relationship links in the design of relational database schemas. Multi-relational classification can be widely used in many disciplines, such as financial decision making, medical research, and geographical applications. However, most classification approaches only work on single "flat" data relations. It is usually difficult to convert multiple relations into a single flat relation without either introducing huge, undesirable "universal relation" or losing essential information. Previous works using Inductive Logic Programming approaches (recently also known as Relational Mining) have proven effective with high accuracy in multi-relational classification. Unfortunately, they suffer from poor scalability w.r.t. the number of relations and the number of attributes in databases. In this paper we propose CrossMine, an efficient and scalable approach for multi-relational classification. Several novel methods are developed in CrossMine, including (1) tuple ID propagation, which performs semantics-preserving virtual join to achieve high efficiency on databases with complex schemas, and (2) a selective sampling method, which makes it highly scalable w.r.t. the number of tuples in the databases. Both theoretical backgrounds and implementation techniques of CrossMine are introduced. Our comprehensive experiments on both real and synthetic databases demonstrate the high scalability and accuracy of CrossMine.

Original languageEnglish (US)
Pages399-410
Number of pages12
StatePublished - Jun 1 2004
EventProceedings - 20th International Conference on Data Engineering - ICDE 2004 - Boston, MA., United States
Duration: Mar 30 2004Apr 2 2004

Other

OtherProceedings - 20th International Conference on Data Engineering - ICDE 2004
CountryUnited States
CityBoston, MA.
Period3/30/044/2/04

Fingerprint

Scalability
Inductive logic programming (ILP)
Decision making
Semantics
Sampling
Experiments

ASJC Scopus subject areas

  • Software
  • Signal Processing
  • Information Systems

Cite this

Yin, X., Han, J., Yang, J., & Yu, P. S. (2004). CrossMine: Efficient classification across multiple database relations. 399-410. Paper presented at Proceedings - 20th International Conference on Data Engineering - ICDE 2004, Boston, MA., United States.

CrossMine : Efficient classification across multiple database relations. / Yin, Xiaoxin; Han, Jiawei; Yang, Jiong; Yu, Philip S.

2004. 399-410 Paper presented at Proceedings - 20th International Conference on Data Engineering - ICDE 2004, Boston, MA., United States.

Research output: Contribution to conferencePaper

Yin, X, Han, J, Yang, J & Yu, PS 2004, 'CrossMine: Efficient classification across multiple database relations', Paper presented at Proceedings - 20th International Conference on Data Engineering - ICDE 2004, Boston, MA., United States, 3/30/04 - 4/2/04 pp. 399-410.
Yin X, Han J, Yang J, Yu PS. CrossMine: Efficient classification across multiple database relations. 2004. Paper presented at Proceedings - 20th International Conference on Data Engineering - ICDE 2004, Boston, MA., United States.
Yin, Xiaoxin ; Han, Jiawei ; Yang, Jiong ; Yu, Philip S. / CrossMine : Efficient classification across multiple database relations. Paper presented at Proceedings - 20th International Conference on Data Engineering - ICDE 2004, Boston, MA., United States.12 p.
@conference{4240cba7d95644ed909e1a98f75d33f5,
title = "CrossMine: Efficient classification across multiple database relations",
abstract = "Most of today's structured data is stored in relational databases. Such a database consists of multiple relations which are linked together conceptually via entity-relationship links in the design of relational database schemas. Multi-relational classification can be widely used in many disciplines, such as financial decision making, medical research, and geographical applications. However, most classification approaches only work on single {"}flat{"} data relations. It is usually difficult to convert multiple relations into a single flat relation without either introducing huge, undesirable {"}universal relation{"} or losing essential information. Previous works using Inductive Logic Programming approaches (recently also known as Relational Mining) have proven effective with high accuracy in multi-relational classification. Unfortunately, they suffer from poor scalability w.r.t. the number of relations and the number of attributes in databases. In this paper we propose CrossMine, an efficient and scalable approach for multi-relational classification. Several novel methods are developed in CrossMine, including (1) tuple ID propagation, which performs semantics-preserving virtual join to achieve high efficiency on databases with complex schemas, and (2) a selective sampling method, which makes it highly scalable w.r.t. the number of tuples in the databases. Both theoretical backgrounds and implementation techniques of CrossMine are introduced. Our comprehensive experiments on both real and synthetic databases demonstrate the high scalability and accuracy of CrossMine.",
author = "Xiaoxin Yin and Jiawei Han and Jiong Yang and Yu, {Philip S.}",
year = "2004",
month = "6",
day = "1",
language = "English (US)",
pages = "399--410",
note = "Proceedings - 20th International Conference on Data Engineering - ICDE 2004 ; Conference date: 30-03-2004 Through 02-04-2004",

}

TY - CONF

T1 - CrossMine

T2 - Efficient classification across multiple database relations

AU - Yin, Xiaoxin

AU - Han, Jiawei

AU - Yang, Jiong

AU - Yu, Philip S.

PY - 2004/6/1

Y1 - 2004/6/1

N2 - Most of today's structured data is stored in relational databases. Such a database consists of multiple relations which are linked together conceptually via entity-relationship links in the design of relational database schemas. Multi-relational classification can be widely used in many disciplines, such as financial decision making, medical research, and geographical applications. However, most classification approaches only work on single "flat" data relations. It is usually difficult to convert multiple relations into a single flat relation without either introducing huge, undesirable "universal relation" or losing essential information. Previous works using Inductive Logic Programming approaches (recently also known as Relational Mining) have proven effective with high accuracy in multi-relational classification. Unfortunately, they suffer from poor scalability w.r.t. the number of relations and the number of attributes in databases. In this paper we propose CrossMine, an efficient and scalable approach for multi-relational classification. Several novel methods are developed in CrossMine, including (1) tuple ID propagation, which performs semantics-preserving virtual join to achieve high efficiency on databases with complex schemas, and (2) a selective sampling method, which makes it highly scalable w.r.t. the number of tuples in the databases. Both theoretical backgrounds and implementation techniques of CrossMine are introduced. Our comprehensive experiments on both real and synthetic databases demonstrate the high scalability and accuracy of CrossMine.

AB - Most of today's structured data is stored in relational databases. Such a database consists of multiple relations which are linked together conceptually via entity-relationship links in the design of relational database schemas. Multi-relational classification can be widely used in many disciplines, such as financial decision making, medical research, and geographical applications. However, most classification approaches only work on single "flat" data relations. It is usually difficult to convert multiple relations into a single flat relation without either introducing huge, undesirable "universal relation" or losing essential information. Previous works using Inductive Logic Programming approaches (recently also known as Relational Mining) have proven effective with high accuracy in multi-relational classification. Unfortunately, they suffer from poor scalability w.r.t. the number of relations and the number of attributes in databases. In this paper we propose CrossMine, an efficient and scalable approach for multi-relational classification. Several novel methods are developed in CrossMine, including (1) tuple ID propagation, which performs semantics-preserving virtual join to achieve high efficiency on databases with complex schemas, and (2) a selective sampling method, which makes it highly scalable w.r.t. the number of tuples in the databases. Both theoretical backgrounds and implementation techniques of CrossMine are introduced. Our comprehensive experiments on both real and synthetic databases demonstrate the high scalability and accuracy of CrossMine.

UR - http://www.scopus.com/inward/record.url?scp=2442458705&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=2442458705&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:2442458705

SP - 399

EP - 410

ER -