Multi-sorting algorithm for finding pairs of similar short substrings from large-scale string data

被引:0
|
作者
Takeaki Uno
机构
[1] National Institute of Informatics,
来源
Knowledge and Information Systems | 2010年 / 25卷
关键词
Neighbor search; Neighbor graph construction; Similarity analysis; Data analysis; Large scale data; Homology search;
D O I
暂无
中图分类号
学科分类号
摘要
Finding similar substrings/substructures is a central task in analyzing huge string data such as genome sequences, Web documents, log data, feature vectors of pictures, photos, videos, etc. Although the existence of polynomial time algorithms for such problems is trivial since the number of substrings is bounded by the square of their lengths, straightforward algorithms do not work for huge databases because of their high degree order of the computation time. This paper addresses the problem of finding pairs of strings with small Hamming distances from huge databases composed of short strings of a fixed length. Comparison of long strings can be solved by inputting all their substrings of fixed length so that we can find candidates of similar non-short substrings. We focus on the practical efficiency of algorithms, and propose an algorithm that runs in time almost linear in the input/output size. We prove that the computation time of its variant is linear in the database size when the length of the short strings is constant, and computational experiments for genome sequences and Web texts show its practical efficiency. Slight modifications adapt to the edit distance and mismatch tolerance computation. An implementation is available at the author’s homepage.
引用
收藏
页码:229 / 251
页数:22
相关论文
共 45 条
  • [21] Large-scale nonlinear Granger causality for inferring directed dependence from short multivariate time-series data
    Axel Wismüller
    Adora M. Dsouza
    M. Ali Vosoughi
    Anas Abidin
    Scientific Reports, 11
  • [22] MCbiclust: a novel algorithm to discover large-scale functionally related gene sets from massive transcriptomics data collections
    Bentham, Robert B.
    Bryson, Kevin
    Szabadkai, Gyorgy
    NUCLEIC ACIDS RESEARCH, 2017, 45 (15) : 8712 - 8730
  • [23] Integrating machine learning models to learn potentially non-monotonic preferences for multi-criteria sorting from large-scale assignment examples
    Li, Zhuolin
    Zhang, Zhen
    Pedrycz, Witold
    OMEGA-INTERNATIONAL JOURNAL OF MANAGEMENT SCIENCE, 2025, 131
  • [25] LAKUBE: An Improved Multi-Armed Bandit Algorithm for Strongly Budget-Constrained Conditions on Collecting Large-Scale Sensor Network Data
    Kadono, Yoshiaki
    Fukuta, Naoki
    PRICAI 2014: TRENDS IN ARTIFICIAL INTELLIGENCE, 2014, 8862 : 1089 - 1095
  • [26] A Multi-Layered GRU Model for COVID-19 Patient Representation and Phenotyping from Large-Scale EHR Data
    Saha, Arpita
    Samaan, Maggie
    Peng, Bo
    Ning, Xia
    14TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, BCB 2023, 2023,
  • [27] A multi-hierarchical method to extract spatial network structures from large-scale origin-destination flow data
    Zhou, Xingxing
    Zhang, Haiping
    Ye, Xinyue
    INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE, 2024, 38 (03) : 577 - 602
  • [28] Predicting high risk birth from real large-scale cardiotocographic data using multi-input convolutional neural networks
    Mohannad, Alkanan
    Shibata, Chihiro
    Miyata, Kohei
    Imamura, Toshiro
    Miyamoto, Shingo
    Fukunishi, Hiroaki
    Kameda, Hiroyuki
    IEICE NONLINEAR THEORY AND ITS APPLICATIONS, 2021, 12 (03): : 399 - 411
  • [29] Answer ALS: A Large-Scale Resource for Sporadic and Familial ALS Combining Clinical Data with Multi-Omics Data from Induced Pluripotent Cell Lines
    Rothstein, Jeffrey D.
    Baxi, Emily
    Thompson, Terri
    Maragakis, Nicholas
    Berry, James
    Cudkowicz, Merit
    Sareen, Dhruv
    Van Eyke, Jenny
    Finkbeiner, Steven
    Thompson, Leslie
    Fraenkel, Ernest
    Svendsen, Clive
    ANNALS OF NEUROLOGY, 2021, 90 : S203 - S203
  • [30] Does combining numerous data types in multi-omics data improve or hinder performance in survival prediction? Insights from a large-scale benchmark study
    Li, Yingxia
    Herold, Tobias
    Mansmann, Ulrich
    Hornung, Roman
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2024, 24 (01)