MACFP: Maximal approximate consecutive frequent pattern mining under edit distance

Jingbo Shang, Jian Peng, Jiawei Han

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Consecutive pattern mining aiming at finding sequential patterns substrings, is a special case of frequent pattern mining and has been played a crucial role in many real world applications, especially in biological sequence analysis, time series analysis, and network log mining. Approximations, including insertions, deletions, and substitutions, between strings are widely used in biological sequence comparisons. However, most existing string pattern mining methods only consider hamming distance without insertions/deletions (indels). Little attention has been paid to the general approximate consecutive frequent pattern mining under edit distance, potentially due to the high computational complexity, particularly on DNA sequences with billions of base pairs. In this paper, we introduce an efficient solution to this problem. We first formulate the Maximal Approximate Consecutive Frequent Pattern Mining (MACFP) problem that identifies substring patterns under edit distance in a long query sequence. Then, we propose a novel algorithm with linear time complexity to check whether the support of a substring pattern is above a predefined threshold in the query sequence, thus greatly reducing the computational complexity of MACFP. With this fast decision algorithm, we can efficiently solve the original pattern discovery problem with several indexing and searching techniques. Comprehensive experiments on sequence pattern analysis and a study on cancer genomics application demonstrate the effectiveness and efficiency of our algorithm, compared to several existing methods.

Original languageEnglish (US)
Title of host publication16th SIAM International Conference on Data Mining 2016, SDM 2016
EditorsSanjay Chawla Venkatasubramanian, Wagner Meira
PublisherSociety for Industrial and Applied Mathematics Publications
Pages558-566
Number of pages9
ISBN (Electronic)9781510828117
DOIs
StatePublished - 2016
Event16th SIAM International Conference on Data Mining 2016, SDM 2016 - Miami, United States
Duration: May 5 2016May 7 2016

Publication series

Name16th SIAM International Conference on Data Mining 2016, SDM 2016

Other

Other16th SIAM International Conference on Data Mining 2016, SDM 2016
Country/TerritoryUnited States
CityMiami
Period5/5/165/7/16

ASJC Scopus subject areas

  • Computer Science Applications
  • Software

Fingerprint

Dive into the research topics of 'MACFP: Maximal approximate consecutive frequent pattern mining under edit distance'. Together they form a unique fingerprint.

Cite this