TY - JOUR
T1 - From sequential pattern mining to structured pattern mining
T2 - A pattern-growth approach
AU - Han, Jia Wei
AU - Pei, Jian
AU - Yan, Xi Feng
N1 - Funding Information:
A sequence database S is a set of tuples (sid, s), where sid is a sequenceAd and s a sequence. A tuple lsid, s} is said to contain a sequence a, if a is a subsequence of s. The support of a sequence a in a sequence database S is the number of tuples in the database containing a, i.e., supports(a ) = I{(sid, s}l((sid, s) e S) A (a ~_ s)}\[. It can be denoted by support(a) if the sequence database is clear from the context. Given a positive * Survey The work was supported in part by the Natural Sciences and Engineering Research Council of Canada, the Networks of Centres of Excellence of Canada, the Hewlett-Packard Lab, the U.S. National Science Foundation (Grant Nos. NSF IIS-02-09199, NSF IIS-03-08001), and the University of Illinois. Any opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.
PY - 2004/5
Y1 - 2004/5
N2 - Sequential pattern mining is an important data mining problem with broad applications. However, it is also a challenging problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Recent studies have developed two major classes of sequential pattern mining methods: (1) a candidate generation-and-test approach, represented by (i) GSP, a horizontal format-based sequential pattern mining method, and (ii) SPADE, a vertical format-based method; and (2) a pattern-growth method, represented by PrefixSpan and its further extensions, such as gSpan for mining structured patterns. In this study, we perform a systematic introduction and presentation of the pattern-growth methodology and study its principles and extensions. We first introduce two interesting pattern-growth algorithms, FreeSpan and PrefixSpan, for efficient sequential pattern mining. Then we introduce gSpan for mining structured patterns using the same methodology. Their relative performance in large databases is presented and analyzed. Several extensions of these methods are also discussed in the paper, including mining multi-level, multi-dimensional patterns and mining constraint-based patterns.
AB - Sequential pattern mining is an important data mining problem with broad applications. However, it is also a challenging problem since the mining may have to generate or examine a combinatorially explosive number of intermediate subsequences. Recent studies have developed two major classes of sequential pattern mining methods: (1) a candidate generation-and-test approach, represented by (i) GSP, a horizontal format-based sequential pattern mining method, and (ii) SPADE, a vertical format-based method; and (2) a pattern-growth method, represented by PrefixSpan and its further extensions, such as gSpan for mining structured patterns. In this study, we perform a systematic introduction and presentation of the pattern-growth methodology and study its principles and extensions. We first introduce two interesting pattern-growth algorithms, FreeSpan and PrefixSpan, for efficient sequential pattern mining. Then we introduce gSpan for mining structured patterns using the same methodology. Their relative performance in large databases is presented and analyzed. Several extensions of these methods are also discussed in the paper, including mining multi-level, multi-dimensional patterns and mining constraint-based patterns.
KW - Data mining
KW - Performance analysis
KW - Scalability
KW - Sequential pattern mining
KW - Structured pattern mining
UR - http://www.scopus.com/inward/record.url?scp=2442650185&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=2442650185&partnerID=8YFLogxK
U2 - 10.1007/BF02944897
DO - 10.1007/BF02944897
M3 - Article
AN - SCOPUS:2442650185
SN - 1000-9000
VL - 19
SP - 257
EP - 279
JO - Journal of Computer Science and Technology
JF - Journal of Computer Science and Technology
IS - 3
ER -