A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

被引:29
|
作者
Luczak, Brian B. [1 ]
James, Benjamin T. [2 ]
Girgis, Hani Z. [3 ,4 ,5 ,6 ]
机构
[1] Univ Tulsa TU, Tulsa, OK USA
[2] TU, Tulsa, OK USA
[3] TU, Comp Sci, Tulsa, OK USA
[4] NIH, Bldg 10, Bethesda, MD 20892 USA
[5] Johns Hopkins Univ, Baltimore, MD 21218 USA
[6] SUNY Buffalo, Buffalo, NY USA
关键词
alignment-free k-mer statistics; DNA sequence comparison; k-mer histograms; paired statistics; SIMILARITY; DISTANCE; SEARCH; MODELS; GENOME;
D O I
10.1093/bib/bbx161
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. Results: We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover's distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover's distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. Availability: The source code of the benchmarking tool is available as Supplementary Materials.
引用
收藏
页码:1222 / 1237
页数:16
相关论文
共 50 条
  • [21] An Algorithm for Alignment-free Sequence Comparison using Logical Match
    Shanker, Sanil
    Austin, Jim
    Sherly, Elizabeth
    2010 2ND INTERNATIONAL CONFERENCE ON COMPUTER AND AUTOMATION ENGINEERING (ICCAE 2010), VOL 3, 2010, : 536 - 538
  • [22] Statistical considerations underpinning an alignment-free sequence comparison method
    Junmei Jing
    Conrad J. Burden
    Sylvain Forêt
    Susan R. Wilson
    Journal of the Korean Statistical Society, 2010, 39 : 325 - 335
  • [23] Alignment-Free Sequence Comparison over Hadoop for Computational Biology
    Cattaneo, Giuseppe
    Petrillo, Umberto Ferraro
    Giancarlo, Raffaele
    Roscigno, Gianluca
    2015 44TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS, 2015, : 184 - 192
  • [24] Statistical considerations underpinning an alignment-free sequence comparison method
    Jing, Junmei
    Burden, Conrad J.
    Foret, Sylvain
    Wilson, Susan R.
    JOURNAL OF THE KOREAN STATISTICAL SOCIETY, 2010, 39 (03) : 325 - 335
  • [25] Variable length local decoding and alignment-free sequence comparison
    Didier, Gilles
    Corel, Eduardo
    Laprevotte, Ivan
    Grossmann, Alex
    Landes-Devauchelle, Claudine
    THEORETICAL COMPUTER SCIENCE, 2012, 462 : 1 - 11
  • [26] Positional difference and Frequency (PdF) based alignment-free technique for genome sequence comparison
    Dey, Sudeshna
    Ghosh, Papri
    Das, Subhram
    JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS, 2024, 42 (23): : 12660 - 12688
  • [27] Application of Sequence Alignment-Free Comparison-Based SeqDistK to Microbial Flora Clustering
    Liu X.
    Huang G.
    Huang T.
    Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2019, 47 (11): : 71 - 77
  • [28] An alignment-free measure based on physicochemical properties of amino acids for protein sequence comparison
    Zhao, Yunxiu
    Xue, Xiaolong
    Xie, Xiaoli
    COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2019, 80 : 10 - 15
  • [29] Alignment-free Sequence Comparison for Biologically Realistic Sequences of Moderate Length
    Burden, Conrad J.
    Jing, Junmei
    Wilson, Susan R.
    STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY, 2012, 11 (01)
  • [30] SENSE: Siamese neural network for sequence embedding and alignment-free comparison
    Zheng, Wei
    Yang, Le
    Genco, Robert J.
    Wactawski-Wende, Jean
    Buck, Michael
    Sun, Yijun
    BIOINFORMATICS, 2019, 35 (11) : 1820 - 1828