TY - GEN
T1 - Direct mining of discriminative and essential frequent patterns via model-based search tree
AU - Fan, Wei
AU - Zhang, Kun
AU - Cheng, Hong
AU - Gao, Jing
AU - Yan, Xifeng
AU - Han, Jiawei
AU - Yu, Philip
AU - Verscheure, Olivier
PY - 2008
Y1 - 2008
N2 - Frequent patterns provide solutions to datasets that do not have well-structured feature vectors. However, frequent pattern mining is non-trivial since the number of unique patterns is exponential but many are non-discriminative and correlated. Currently, frequent pattern mining is performed in two sequential steps: enumerating a set of frequent patterns, followed by feature selection. Although many methods have been proposed in the past few years on how to perform each separate step efficiently, there is still limited success in eventually finding highly compact and discriminative patterns. The culprit is due to the inherent nature of this widely adopted two-step approach. This paper discusses these problems and proposes a new and different method. It builds a decision tree that partitions the data onto different nodes. Then at each node, it directly discovers a discriminative pattern to further divide its examples into purer subsets. Since the number of examples towards leaf level is relatively small, the new approach is able to examine patterns with extremely low global support that could not be enumerated on the whole dataset by the two-step method. The discovered feature vectors are more accurate on some of the most difficult graph as well as frequent itemset problems than most recently proposed algorithms but the total size is typically 50% or more smaller. Importantly, the minimum support of some discriminative patterns can be extremely low (e.g. 0.03%). In order to enumerate these low support patterns, state-of-the-art frequent pattern algorithm either cannot finish due to huge memory consumption or have to enumerate 10 1 to 10 3 times more patterns before they can even be found. Software and datasets are available by contacting the author.
AB - Frequent patterns provide solutions to datasets that do not have well-structured feature vectors. However, frequent pattern mining is non-trivial since the number of unique patterns is exponential but many are non-discriminative and correlated. Currently, frequent pattern mining is performed in two sequential steps: enumerating a set of frequent patterns, followed by feature selection. Although many methods have been proposed in the past few years on how to perform each separate step efficiently, there is still limited success in eventually finding highly compact and discriminative patterns. The culprit is due to the inherent nature of this widely adopted two-step approach. This paper discusses these problems and proposes a new and different method. It builds a decision tree that partitions the data onto different nodes. Then at each node, it directly discovers a discriminative pattern to further divide its examples into purer subsets. Since the number of examples towards leaf level is relatively small, the new approach is able to examine patterns with extremely low global support that could not be enumerated on the whole dataset by the two-step method. The discovered feature vectors are more accurate on some of the most difficult graph as well as frequent itemset problems than most recently proposed algorithms but the total size is typically 50% or more smaller. Importantly, the minimum support of some discriminative patterns can be extremely low (e.g. 0.03%). In order to enumerate these low support patterns, state-of-the-art frequent pattern algorithm either cannot finish due to huge memory consumption or have to enumerate 10 1 to 10 3 times more patterns before they can even be found. Software and datasets are available by contacting the author.
KW - Algorithms
UR - http://www.scopus.com/inward/record.url?scp=65449172082&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=65449172082&partnerID=8YFLogxK
U2 - 10.1145/1401890.1401922
DO - 10.1145/1401890.1401922
M3 - Conference contribution
AN - SCOPUS:65449172082
SN - 9781605581934
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 230
EP - 238
BT - KDD 2008 - Proceedings of the 14th ACMKDD International Conference on Knowledge Discovery and Data Mining
T2 - 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2008
Y2 - 24 August 2008 through 27 August 2008
ER -