An efficient algorithm for finding similar short substrings from large scale string data

被引:0
|
作者
Uno, Takeaki [1 ]
机构
[1] Natl Inst Informat, Chiyoda Ku, Tokyo 1018430, Japan
来源
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS | 2008年 / 5012卷
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Finding similar substrings/substructures is a central task in analyzing huge amounts of string data such as genome sequences, web documents, log data, etc. In the sense of complexity theory, the existence of polynomial time algorithms for such problems is usually trivial since the number of substrings is bounded by the square of their lengths. However, straight forward algorithms do not work for practical huge databases because of their computation time of high degree order. This paper addresses the problems of finding pairs of strings with small Hamming distances from huge databases composed of short strings. By solving the problem for all the substrings of fixed length, we can efficiently find candidates of similar non-short substrings. We focus on the practical efficiency of algorithms, and propose an algorithm running in almost linear time of the database size. We prove that the computation time of its variant is bounded by linear of the database size when the length of short strings to be found is constant. Slight modifications of the algorithm adapt to the edit distance and mismatch tolerance computation. Computational experiments for genome sequences show the efficiency of the algorithm. An implementation is available at the author's homepage(1).
引用
收藏
页码:345 / 356
页数:12
相关论文
共 50 条
  • [41] An Efficient SAT-Based Algorithm for Finding Short Cycles in Cryptographic Algorithms
    Dubrova, Elena
    Teslenko, Maxim
    PROCEEDINGS OF THE 2018 IEEE INTERNATIONAL SYMPOSIUM ON HARDWARE ORIENTED SECURITY AND TRUST (HOST), 2018, : 65 - 72
  • [42] A Fast and Efficient Algorithm for Finding Frequent Items over Data Stream
    Chen, Ling
    Chen, Yixin
    Tu, Li
    JOURNAL OF COMPUTERS, 2012, 7 (07) : 1545 - 1554
  • [43] A design method of large-scale partial similar model test for cementing casing string system
    Qi, Linshan
    Yin, Yiyong
    Wang, Liyan
    Qu, Congfeng
    Liu, Guodong
    Li, Jun
    Liang, Dong
    Yang, Shuofei
    ALEXANDRIA ENGINEERING JOURNAL, 2025, 110 : 277 - 296
  • [44] Efficient Large Scale Clustering based on Data Partitioning
    Bendechache, Malika
    Le-Khac, Nhien-An
    Kechadi, M-Tahar
    PROCEEDINGS OF 3RD IEEE/ACM INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS, (DSAA 2016), 2016, : 612 - 621
  • [45] Finding Similar Users from GPS Data Based on Assignment Problem
    Lin, Zedong
    Zeng, Qingtian
    Duan, Hua
    Lu, Faming
    PROCEEDINGS OF THE 4TH INTERNATIONAL CONFERENCE ON COMMUNICATION AND INFORMATION PROCESSING (ICCIP 2018), 2018, : 283 - 288
  • [46] Parallel algorithm for finding modules of large-scale coherent fault trees
    Li, Z. F.
    Ren, Y.
    Liu, L. L.
    Wang, Z. L.
    MICROELECTRONICS RELIABILITY, 2015, 55 (9-10) : 1400 - 1403
  • [47] An Efficient Dynamic Programming Algorithm for Phosphorylation Site Assignment of Large-Scale Mass Spectrometry Data
    Saeed, Fahad
    Pisitkun, Trairak
    Hoffert, Jason D.
    Wang, Guanghui
    Gucek, Marjan
    Knepper, Mark A.
    2012 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE WORKSHOPS (BIBMW), 2012,
  • [48] Algorithm for large-scale finding of T box transcription regulation in bacteria
    Leontiev, LA
    Seliverstov, AV
    Lyubetsky, VA
    MOLECULAR BIOLOGY, 2005, 39 (06) : 1076 - 1078
  • [50] An Efficient and Effective Algorithm for Large Scale Global Optimization Problems
    Lian, Kanchao
    Peng, Xu-Yu
    Ouyang, Aijia
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2015, 29 (04)