Space-Efficient String Mining under Frequency Constraints

被引:12
|
作者
Fischer, Johannes [1 ]
Makinen, Veli [2 ]
Valimaki, Niko [2 ]
机构
[1] Univ Tubingen, Ctr Bioinformat ZBIT, Sand 14, D-72076 Tubingen, Germany
[2] Univ Helsinki, Dept Comp Sci, Helsinki, Finland
基金
芬兰科学院;
关键词
D O I
10.1109/ICDM.2008.32
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Let D-1 and D-2 be two databases (i.e. multisets) of d strings, over an alphabet Sigma, with overall length n. We study the problem of mining discriminative patterns between V, and D-2 - e.g., patterns that are frequent in one database but not in the other emerging patterns, or patterns satisfying other frequency-related constraints. Using the algorithmic framework by Hui (CPM 1992), one can solve several variants of this problem in the optimal linear time with the aid of suffix trees or suffix arrays. This stands in high contrast to other pattern domains such as itemsets or subgraphs, where super-linear lower bounds are known. However, the space requirement of existing solutions is O(n log n) bits, which is not optimal for vertical bar Sigma vertical bar << n (in particular for constant vertical bar Sigma vertical bar), as the databases themselves occupy only n log vertical bar Sigma vertical bar bits. Because in many real-life applications space is a more critical resource than time, the aim of this article is to reduce the space, at the cost of an increased running time. In particular, we give a solution for the above problems that uses O(n log vertical bar Sigma vertical bar + d log n) bits, while the time requirement is increased from the optimal linear time to O(n log n). Our new method is tested extensively on a biologically relevant datasets and shown to be usable even on a genome-scale data.
引用
收藏
页码:193 / +
页数:2
相关论文
共 50 条
  • [1] Efficient string mining under constraints via the deferred frequency index
    Weese, David
    Schulz, Marcel H.
    ADVANCES IN DATA MINING, PROCEEDINGS: MEDICAL APPLICATIONS, E-COMMERCE, MARKETING, AND THEORETICAL ASPECTS, 2008, 5077 : 374 - +
  • [2] Optimal string mining under frequency constraints
    Fischer, Johannes
    Heun, Volker
    Kramer, Stefan
    KNOWLEDGE DISCOVERY IN DATABASES: PKDD 2006, PROCEEDINGS, 2006, 4213 : 139 - 150
  • [3] A Framework for Space-Efficient String Kernels
    Djamal Belazzougui
    Fabio Cunial
    Algorithmica, 2017, 79 : 857 - 883
  • [4] A Framework for Space-Efficient String Kernels
    Belazzougui, Djamal
    Cunial, Fabio
    ALGORITHMICA, 2017, 79 (03) : 857 - 883
  • [5] An efficient algorithm for mining string databases under constraints
    Lee, SD
    De Raedt, L
    KNOWLEDGE DISCOVERY IN INDUCTIVE DATABASES, 2005, 3377 : 108 - 129
  • [6] Space-efficient multiple string matching automata
    Zhang, M. (zhangmeng@jlu.edu.cn), 1600, Inderscience Publishers (05):
  • [7] Space-efficient acyclicity constraints: A declarative pearl
    Brock-Nannestad, Taus
    SCIENCE OF COMPUTER PROGRAMMING, 2018, 164 : 66 - 81
  • [8] HashTrie: A space-efficient multiple string matching algorithm
    2015, Editorial Board of Journal on Communications (36):
  • [9] Fast String Matching with Space-efficient Word Graphs
    Yata, Susumu
    Morita, Kazuhiro
    Fuketa, Masao
    Aoe, Jun-ichi
    IIT: 2008 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION TECHNOLOGY, 2008, : 484 - 488
  • [10] Space-efficient computation of parallel approximate string matching
    Muhammad Umair Sadiq
    Muhammad Murtaza Yousaf
    The Journal of Supercomputing, 2023, 79 : 9093 - 9126