TY - JOUR
T1 - Mining sequential patterns by pattern-growth
T2 - The prefixspan approach
AU - Pei, Jian
AU - Han, Jiawei
AU - Mortazavi-Asl, Behzad
AU - Wang, Jianyong
AU - Pinto, Helen
AU - Chen, Qiming
AU - Dayal, Umeshwar
AU - Hsu, Mei Chun
N1 - Funding Information:
The authors are grateful to Dr. Mohammed Zaki for providing them with the source code of SPADE and the vertical data conversion package, as well as promptly answering many of their questions related to SPADE. Also, they would like to express their thanks to Blue Martini Software Inc. for sending them the Gazelle data set and allowing them to publish their algorithm performance results using this data set. The work was supported in part by the Natural Sciences and Engineering Research Council of Canada, the Networks of Centres of Excellence of Canada, the Hewlett-Packard Lab, the US National Science Foundation NSF IIS-02-09199, NSF IIS-03-08001, and the University of Illinois. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the funding agencies. This paper is a major-value added version of the following conference paper [19].
PY - 2004/11
Y1 - 2004/11
N2 - Sequential pattern mining is an important data mining problem with broad applications. However, it Is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of the previously developed sequential pattern mining methods, such as GSP, explore a candidate generation-and-test approach [1] to reduce the number of candidates to be examined. However, this approach may not be efficient in mining large sequence databases having numerous patterns and/or long patterns. In this paper, we propose a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns. In this approach, a sequence database is recursively projected into a set of smaller projected databases, and sequential patterns are grown in each projected database by exploring only locally frequent fragments. Based on an initial study of the pattern growth-based sequential pattern mining, FreeSpan [8], we propose a more efficient method, called PSP, which offers ordered growth and reduced projected databases. To further improve the performance, a pseudoprojection technique is developed in PrefixSpan. A comprehensive performance study shows that PrefixSpan, in most cases, outperforms the a priori-based algorithm GSP, FreeSpan, and SPADE [29] (a sequential pattern mining algorithm that adopts vertical data format), and PrefixSpan integrated with pseudoprojection is the fastest among all the tested algorithms. Furthermore, this mining methodology can be extended to mining sequential patterns with user-specified constraints. The high promise of the pattern-growth approach may lead to its further extension toward efficient mining of other kinds of frequent patterns, such as frequent substructures.
AB - Sequential pattern mining is an important data mining problem with broad applications. However, it Is also a difficult problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Most of the previously developed sequential pattern mining methods, such as GSP, explore a candidate generation-and-test approach [1] to reduce the number of candidates to be examined. However, this approach may not be efficient in mining large sequence databases having numerous patterns and/or long patterns. In this paper, we propose a projection-based, sequential pattern-growth approach for efficient mining of sequential patterns. In this approach, a sequence database is recursively projected into a set of smaller projected databases, and sequential patterns are grown in each projected database by exploring only locally frequent fragments. Based on an initial study of the pattern growth-based sequential pattern mining, FreeSpan [8], we propose a more efficient method, called PSP, which offers ordered growth and reduced projected databases. To further improve the performance, a pseudoprojection technique is developed in PrefixSpan. A comprehensive performance study shows that PrefixSpan, in most cases, outperforms the a priori-based algorithm GSP, FreeSpan, and SPADE [29] (a sequential pattern mining algorithm that adopts vertical data format), and PrefixSpan integrated with pseudoprojection is the fastest among all the tested algorithms. Furthermore, this mining methodology can be extended to mining sequential patterns with user-specified constraints. The high promise of the pattern-growth approach may lead to its further extension toward efficient mining of other kinds of frequent patterns, such as frequent substructures.
KW - Data mining algorithm
KW - Frequent pattern
KW - Performance analysis
KW - Scalability
KW - Sequence database
KW - Sequential pattern
KW - Transaction database
UR - http://www.scopus.com/inward/record.url?scp=13844256245&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=13844256245&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2004.77
DO - 10.1109/TKDE.2004.77
M3 - Article
AN - SCOPUS:13844256245
SN - 1041-4347
VL - 16
SP - 1424
EP - 1440
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 11
ER -