Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data

被引:0
|
作者
Takeaki Uno
机构
[1] National Institute of Informatics,
来源
关键词
Neighbor search; Neighbor graph construction; Similarity analysis; Data analysis; Large scale data; Homology search;
D O I
暂无
中图分类号
学科分类号
摘要
Finding similar substrings/substructures is a central task in analyzing huge string data such as genome sequences, Web documents, log data, feature vectors of pictures, photos, videos, etc. Although the existence of polynomial time algorithms for such problems is trivial since the number of substrings is bounded by the square of their lengths, straightforward algorithms do not work for huge databases because of their high degree order of the computation time. This paper addresses the problem of finding pairs of strings with small Hamming distances from huge databases composed of short strings of a fixed length. Comparison of long strings can be solved by inputting all their substrings of fixed length so that we can find candidates of similar non-short substrings. We focus on the practical efficiency of algorithms, and propose an algorithm that runs in time almost linear in the input/output size. We prove that the computation time of its variant is linear in the database size when the length of the short strings is constant, and computational experiments for genome sequences and Web texts show its practical efficiency. Slight modifications adapt to the edit distance and mismatch tolerance computation. An implementation is available at the author’s homepage.
引用
收藏
页码:229 / 251
页数:22
相关论文
共 45 条
  • [41] Innovative approach for predicting biogas production from large-scale anaerobic digester using long-short term memory (LSTM) coupled with genetic algorithm (GA)
    Salamattalab, Mohammad Milad
    Zonoozi, Maryam Hasani
    Molavi-Arabshahi, Mahboubeh
    WASTE MANAGEMENT, 2024, 175 : 30 - 41
  • [42] Large-scale apple orchard mapping from multi-source data using the semantic segmentation model with image- to- image translation and transfer learning
    Zhang, Tingting
    Hu, Danni
    Wu, Chunxiao
    Liu, Yundan
    Yang, Jianyu
    Tang, Kaixuan
    COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2023, 213
  • [43] A multicenter study utilizes multi-scale clinical data from Crohn's disease patients to develop large-scale language-human interaction models for predicting the progression of intestinal diseases
    Li, Z.
    Zhang, R.
    Wang, Y.
    Wang, X.
    Mao, R.
    Feng, S. T.
    Li, X.
    JOURNAL OF CROHNS & COLITIS, 2025, 19 : I1035 - I1035
  • [44] IdenHerb: A strategy for identifying constitutive herbs of herbal products by screening exclusive ions of each herb from large-scale multi-group LC-MS data
    Li, Yang
    Zhan, Peng
    Xue, Shu-Ya
    Xiang, Lin-Han
    Feng, Meng-Ge
    Wang, Li-Qing
    Cheng, Zi-Kang
    Lv, Yang
    Zhao, Zhi-Gao
    Ma, Wen
    Chen, Li-Zhi
    Liu, Guang-Xue
    Shang, Ming-Ying
    Cai, Shao-Qing
    Xu, Feng
    JOURNAL OF CHROMATOGRAPHY A, 2025, 1743
  • [45] Physical activity and the older adult: A pragmatic, multi-perspective view of driving behaviour change in an ageing population: The REtirement in ACTion (REACT) study. A large-scale, multi-centre, pragmatic randomised control trial to prevent mobility-related disability in older adults. Successful recruitment strategies and findings from the baseline data
    Withall, Janet
    Greaves, Colin
    Thompson, Janice
    Western, Max
    de Koning, Jolanthe
    Bollen, Jessica
    Moorlock, Sarah
    Zisi, Vasiliki
    Stathi, Afroditi
    JOURNAL OF PHYSICAL ACTIVITY & HEALTH, 2018, 15 (10): : S24 - S24