TY - GEN
T1 - Mining long sequential patterns in a noisy environment
AU - Yang, Jiong
AU - Wang, Wei
AU - Yu, Philip S.
AU - Han, Jiawei
N1 - Publisher Copyright:
© 2002 ACM.
PY - 2002/6/3
Y1 - 2002/6/3
N2 - Pattern discovery in long sequences is of great importance in many applications including computational biology study, consumer behavior analysis, system performance analysis, etc. In a noisy environment, an observed sequence may not accurately reflect the underlying behavior. For example, in a protein sequence, the amino acid N is likely to mutate to D with little impact to the biological function of the protein. It would be desirable if the occurrence of D in the observation can be related to a possible mutation from N in an appropriate manner. Unfortunately, the support measure (i.e., the number of occurrences) of a pattern does not serve this purpose. In this paper, we introduce the concept of compatibility matrix as the means to provide a probabilistic connection from the observation to the underlying true value. A new metric match is also proposed to capture the "real support"of a pattern which would be expected if a noise-free environment is assumed. In addition, in the context we address, a pattern could be very long. The standard pruning technique developed for the market basket problem may not work efficiently. As a result, a novel algorithm that combines statistical sampling and a new technique (namely border collapsing) is devised to discover long patterns in a minimal number of scans of the sequence database with sufficiently high confidence. Empirical results demonstrate the robustness of the match model (with respect to the noise) and the efficiency of the probabilistic algorithm.
AB - Pattern discovery in long sequences is of great importance in many applications including computational biology study, consumer behavior analysis, system performance analysis, etc. In a noisy environment, an observed sequence may not accurately reflect the underlying behavior. For example, in a protein sequence, the amino acid N is likely to mutate to D with little impact to the biological function of the protein. It would be desirable if the occurrence of D in the observation can be related to a possible mutation from N in an appropriate manner. Unfortunately, the support measure (i.e., the number of occurrences) of a pattern does not serve this purpose. In this paper, we introduce the concept of compatibility matrix as the means to provide a probabilistic connection from the observation to the underlying true value. A new metric match is also proposed to capture the "real support"of a pattern which would be expected if a noise-free environment is assumed. In addition, in the context we address, a pattern could be very long. The standard pruning technique developed for the market basket problem may not work efficiently. As a result, a novel algorithm that combines statistical sampling and a new technique (namely border collapsing) is devised to discover long patterns in a minimal number of scans of the sequence database with sufficiently high confidence. Empirical results demonstrate the robustness of the match model (with respect to the noise) and the efficiency of the probabilistic algorithm.
UR - http://www.scopus.com/inward/record.url?scp=85134356361&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85134356361&partnerID=8YFLogxK
U2 - 10.1145/564691.564738
DO - 10.1145/564691.564738
M3 - Conference contribution
AN - SCOPUS:85134356361
T3 - Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD 2002
SP - 406
EP - 417
BT - Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD 2002
PB - Association for Computing Machinery
T2 - 2002 ACM SIGMOD International Conference on Management of Data, SIGMOD 2002
Y2 - 3 June 2002 through 6 June 2002
ER -