A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

被引:29
|
作者
Luczak, Brian B. [1 ]
James, Benjamin T. [2 ]
Girgis, Hani Z. [3 ,4 ,5 ,6 ]
机构
[1] Univ Tulsa TU, Tulsa, OK USA
[2] TU, Tulsa, OK USA
[3] TU, Comp Sci, Tulsa, OK USA
[4] NIH, Bldg 10, Bethesda, MD 20892 USA
[5] Johns Hopkins Univ, Baltimore, MD 21218 USA
[6] SUNY Buffalo, Buffalo, NY USA
关键词
alignment-free k-mer statistics; DNA sequence comparison; k-mer histograms; paired statistics; SIMILARITY; DISTANCE; SEARCH; MODELS; GENOME;
D O I
10.1093/bib/bbx161
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. Results: We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover's distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover's distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. Availability: The source code of the benchmarking tool is available as Supplementary Materials.
引用
收藏
页码:1222 / 1237
页数:16
相关论文
共 50 条
  • [41] Protein map: An alignment-free sequence comparison method based on various properties of amino acids
    Yu, Chenglong
    Cheng, Shiu-Yuen
    He, Rong L.
    Yau, Stephen S. -T.
    GENE, 2011, 486 (1-2) : 110 - 118
  • [42] Alignment-free viral sequence classification at scale
    Daniel J. van Zyl
    Marcel Dunaiski
    Houriiyah Tegally
    Cheryl Baxter
    Tulio de Oliveira
    Joicymara S. Xavier
    BMC Genomics, 26 (1)
  • [43] CAFE: aCcelerated Alignment-FrEe sequence analysis
    Lu, Yang Young
    Tang, Kujin
    Ren, Jie
    Fuhrman, Jed A.
    Waterman, Michael S.
    Sun, Fengzhu
    NUCLEIC ACIDS RESEARCH, 2017, 45 (W1) : W554 - W559
  • [44] An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop
    Cattaneo, Giuseppe
    Petrillo, Umberto Ferraro
    Giancarlo, Raffaele
    Roscigno, Gianluca
    JOURNAL OF SUPERCOMPUTING, 2017, 73 (04): : 1467 - 1483
  • [45] Interpreting alignment-free sequence comparison: what makes a score a good score?
    Swain, Martin T.
    Vickers, Martin
    NAR GENOMICS AND BIOINFORMATICS, 2022, 4 (03)
  • [46] Extraction of high quality k-words for alignment-free sequence comparison
    Gunasinghe, Upuli
    Alahakoon, Damminda
    Bedingfield, Susan
    JOURNAL OF THEORETICAL BIOLOGY, 2014, 358 : 31 - 51
  • [47] Fast alignment-free sequence comparison using spaced-word frequencies
    Leimeister, Chris-Andre
    Boden, Marcus
    Horwege, Sebastian
    Lindner, Sebastian
    Morgenstern, Burkhard
    BIOINFORMATICS, 2014, 30 (14) : 1991 - 1999
  • [48] Alignment-Free Sequence Comparison Using N-Dimensional Similarity Space
    Jayalakshmi, Ramamurthy
    Natarajan, Ramanathan
    Vivekanandan, Munusamy
    Natarajan, Ganapathy S.
    CURRENT COMPUTER-AIDED DRUG DESIGN, 2010, 6 (04) : 290 - 296
  • [49] An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop
    Giuseppe Cattaneo
    Umberto Ferraro Petrillo
    Raffaele Giancarlo
    Gianluca Roscigno
    The Journal of Supercomputing, 2017, 73 : 1467 - 1483
  • [50] Numerical Characterization of DNA Sequences for Alignment-free Sequence Comparison-A Review
    Ramanathan, Natarajan
    Ramamurthy, Jayalakshmi
    Natarajan, Ganapathy
    COMBINATORIAL CHEMISTRY & HIGH THROUGHPUT SCREENING, 2022, 25 (03) : 365 - 380