A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

被引:29
|
作者
Luczak, Brian B. [1 ]
James, Benjamin T. [2 ]
Girgis, Hani Z. [3 ,4 ,5 ,6 ]
机构
[1] Univ Tulsa TU, Tulsa, OK USA
[2] TU, Tulsa, OK USA
[3] TU, Comp Sci, Tulsa, OK USA
[4] NIH, Bldg 10, Bethesda, MD 20892 USA
[5] Johns Hopkins Univ, Baltimore, MD 21218 USA
[6] SUNY Buffalo, Buffalo, NY USA
关键词
alignment-free k-mer statistics; DNA sequence comparison; k-mer histograms; paired statistics; SIMILARITY; DISTANCE; SEARCH; MODELS; GENOME;
D O I
10.1093/bib/bbx161
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. Results: We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover's distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover's distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. Availability: The source code of the benchmarking tool is available as Supplementary Materials.
引用
收藏
页码:1222 / 1237
页数:16
相关论文
共 50 条
  • [1] Alignment-Free Sequence Comparison (I): Statistics and Power
    Reinert, Gesine
    Chew, David
    Sun, Fengzhu
    Waterman, Michael S.
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2009, 16 (12) : 1615 - 1634
  • [2] Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics
    Wan, Lin
    Reinert, Gesine
    Sun, Fengzhu
    Waterman, Michael S.
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2010, 17 (11) : 1467 - +
  • [3] Alignment-free sequence comparison - a review
    Vinga, S
    Almeida, J
    BIOINFORMATICS, 2003, 19 (04) : 513 - 523
  • [4] Multiple alignment-free sequence comparison
    Ren, Jie
    Song, Kai
    Sun, Fengzhu
    Deng, Minghua
    Reinert, Gesine
    BIOINFORMATICS, 2013, 29 (21) : 2690 - 2698
  • [5] Benchmarking of alignment-free sequence comparison methods
    Zielezinski, Andrzej
    Girgis, Hani Z.
    Bernard, Guillaume
    Leimeister, Chris-Andre
    Tang, Kujin
    Dencker, Thomas
    Lau, Anna Katharina
    Roehling, Sophie
    Choi, Jae Jin
    Waterman, Michael S.
    Comin, Matteo
    Kim, Sung-Hou
    Vinga, Susana
    Almeida, Jonas S.
    Chan, Cheong Xin
    James, Benjamin T.
    Sun, Fengzhu
    Morgenstern, Burkhard
    Karlowski, Wojciech M.
    GENOME BIOLOGY, 2019, 20 (1)
  • [6] A probabilistic measure for alignment-free sequence comparison
    Pham, TD
    Zuegg, J
    BIOINFORMATICS, 2004, 20 (18) : 3455 - 3461
  • [7] Benchmarking of alignment-free sequence comparison methods
    Andrzej Zielezinski
    Hani Z. Girgis
    Guillaume Bernard
    Chris-Andre Leimeister
    Kujin Tang
    Thomas Dencker
    Anna Katharina Lau
    Sophie Röhling
    Jae Jin Choi
    Michael S. Waterman
    Matteo Comin
    Sung-Hou Kim
    Susana Vinga
    Jonas S. Almeida
    Cheong Xin Chan
    Benjamin T. James
    Fengzhu Sun
    Burkhard Morgenstern
    Wojciech M. Karlowski
    Genome Biology, 20
  • [8] New powerful statistics for alignment-free sequence comparison under a pattern transfer model
    Liu, Xuemei
    Wan, Lin
    Li, Jing
    Reinert, Gesine
    Waterman, Michael S.
    Sun, Fengzhu
    JOURNAL OF THEORETICAL BIOLOGY, 2011, 284 (01) : 106 - 116
  • [9] Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective
    Bohnsack, Katrin Sophie
    Kaden, Marika
    Abel, Julia
    Villmann, Thomas
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2023, 20 (01) : 119 - 135
  • [10] Weighted measures based on maximizing deviation for alignment-free sequence comparison
    Qian, Kun
    Luan, Yihui
    PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2017, 481 : 235 - 242