Mining approximate patterns with frequent locally optimal occurrences

被引:3
|
作者
Nakamura, Atsuyoshi [1 ]
Takigawa, Ichigaku [1 ]
Tosaka, Hisashi [2 ]
Kudo, Mineichi [1 ]
Mamitsuka, Hiroshi [3 ]
机构
[1] Hokkaido Univ, Kita Ku, Kita 14,Nishi 9, Sapporo, Hokkaido 0600814, Japan
[2] NS Solut Corp, Tokyo, Japan
[3] Kyoto Univ, Inst Chem Res, Uji, Kyoto 6110011, Japan
关键词
Alignment; Frequent pattern mining; String; Ordered tree; DNA; SEQUENTIAL PATTERNS; EFFICIENT; REPEATS; IDENTIFICATION; ALGORITHMS; DISCOVERY; FAMILIES;
D O I
10.1016/j.dam.2015.07.002
中图分类号
O29 [应用数学];
学科分类号
070104 ;
摘要
We consider a frequent approximate pattern mining problem, in which interspersed repetitive regions are extracted from a given string. That is, we enumerate substrings that frequently match substrings of a given string locally and optimally. For this problem, we propose a new algorithm, in which candidate patterns are generated without duplication using the suffix tree of a given string. We further define a k-gap-constrained setting, in which the number of gaps in the alignment between a pattern and an occurrence is limited to at most k. Under this setting, we present memory-efficient algorithms, particularly a candidate-based version, which runs fast enough even over human chromosome sequences with, more than 10 million nucleotides. We note that our problem and algorithms for strings can be directly extended to ordered labeled trees. In our experiments we used both randomly synthesized strings, in which corrupted similar substrings are embedded, and real data of human chromosome. The synthetic data experiments show that our proposed approach extracted embedded patterns correctly and time-efficiently. In real data experiments, we examined the centers of 100 clusters computed after grouping the patterns obtained by our k-gap-constrained versions (k = 0, 1 and 2) and the results revealed that the regions of their occurrences coincided with around a half of the regions automatically annotated as Alu sequences by a manually curated repeat sequence database. (C) 2015 Elsevier B.V. All rights reserved.
引用
收藏
页码:123 / 152
页数:30
相关论文
共 50 条
  • [1] Approximate mining of frequent patterns on streams
    Silvestri, Claudio
    Orlando, Salvatore
    [J]. INTELLIGENT DATA ANALYSIS, 2007, 11 (01) : 49 - 73
  • [2] Mining frequent approximate patterns in large networks
    Driss, Kaouthar
    Boulila, Wadii
    Leborgne, Aurelie
    Gancarski, Pierre
    [J]. INTERNATIONAL JOURNAL OF IMAGING SYSTEMS AND TECHNOLOGY, 2021, 31 (03) : 1265 - 1279
  • [3] Mining Approximate Frequent Patterns From Noisy Databases
    Yu, Xiaomei
    Li, Yongqin
    Wang, Hong
    [J]. 2015 10TH INTERNATIONAL CONFERENCE ON BROADBAND AND WIRELESS COMPUTING, COMMUNICATION AND APPLICATIONS (BWCCA 2015), 2015, : 400 - 403
  • [4] gApprox: Mining frequent approximate patterns from a massive network
    Chen, Chen
    Yan, Xifeng
    Zhu, Feida
    Han, Jiawei
    [J]. ICDM 2007: PROCEEDINGS OF THE SEVENTH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2007, : 445 - +
  • [5] MANIACS: Approximate Mining of Frequent Subgraph Patterns through Sampling
    Preti, Giulia
    Morales, Gianmarco De Francisci
    Riondato, Matteo
    [J]. ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2023, 14 (03)
  • [6] MANIACS: Approximate Mining of Frequent Subgraph Patterns through Sampling
    Preti, Giulia
    Morales, Gianmarco De Francisci
    Riondato, Matteo
    [J]. KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 1348 - 1358
  • [7] Efficient approximate mining of frequent patterns over transactional data streams
    Ng, Willie
    Dash, Manoranjan
    [J]. DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2008, 5182 : 241 - 250
  • [8] Recommending Optimal API Orchestration with Mining Frequent Mashup Patterns
    Peng, Dunlu
    Xie, Lei
    Kai, Duan
    Li, Feitian
    [J]. INTERNATIONAL JOURNAL OF GRID AND DISTRIBUTED COMPUTING, 2014, 7 (03): : 233 - 250
  • [9] TIPTAP: Approximate Mining of Frequent k-Subgraph Patterns in Evolving Graphs
    Nasir, Muhammad Anis Uddin
    Aslay, Cigdem
    Morales, Gianmarco De Francisci
    Riondato, Matteo
    [J]. ACM TRANSACTIONS ON KNOWLEDGE DISCOVERY FROM DATA, 2021, 15 (03)
  • [10] An approximate approach to frequent itemset mining
    Zhang, Chunkai
    Zhang, Xudong
    Tian, Panbo
    [J]. 2017 IEEE SECOND INTERNATIONAL CONFERENCE ON DATA SCIENCE IN CYBERSPACE (DSC), 2017, : 68 - 73