A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

被引：29

作者：

Luczak, Brian B. ^{[1
]}

James, Benjamin T. ^{[2
]}

Girgis, Hani Z. ^{[3
,4
,5
,6
]}

机构：

[1] Univ Tulsa TU, Tulsa, OK USA

[2] TU, Tulsa, OK USA

[3] TU, Comp Sci, Tulsa, OK USA

[4] NIH, Bldg 10, Bethesda, MD 20892 USA

[5] Johns Hopkins Univ, Baltimore, MD 21218 USA

[6] SUNY Buffalo, Buffalo, NY USA

来源：

BRIEFINGS IN BIOINFORMATICS | 2019年 / 20卷 / 04期

关键词：

alignment-free k-mer statistics; DNA sequence comparison; k-mer histograms; paired statistics; SIMILARITY; DISTANCE; SEARCH; MODELS; GENOME;

D O I：

10.1093/bib/bbx161

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. Results: We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover's distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover's distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. Availability: The source code of the benchmarking tool is available as Supplementary Materials.

引用

页码：1222 / 1237

页数：16

共 50 条

[41] Protein map: An alignment-free sequence comparison method based on various properties of amino acids
Yu, Chenglong
Cheng, Shiu-Yuen
He, Rong L.
Yau, Stephen S. -T.
GENE, 2011, 486 (1-2) : 110 - 118
[42] Alignment-free viral sequence classification at scale
Daniel J. van Zyl
Marcel Dunaiski
Houriiyah Tegally
Cheryl Baxter
Tulio de Oliveira
Joicymara S. Xavier
BMC Genomics, 26 (1)
[43] CAFE: aCcelerated Alignment-FrEe sequence analysis
Lu, Yang Young
Tang, Kujin
Ren, Jie
Fuhrman, Jed A.
Waterman, Michael S.
Sun, Fengzhu
NUCLEIC ACIDS RESEARCH, 2017, 45 (W1) : W554 - W559
[44] An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop
Cattaneo, Giuseppe
Petrillo, Umberto Ferraro
Giancarlo, Raffaele
Roscigno, Gianluca
JOURNAL OF SUPERCOMPUTING, 2017, 73 (04): : 1467 - 1483
[45] Interpreting alignment-free sequence comparison: what makes a score a good score?
Swain, Martin T.
Vickers, Martin
NAR GENOMICS AND BIOINFORMATICS, 2022, 4 (03)
[46] Extraction of high quality k-words for alignment-free sequence comparison
Gunasinghe, Upuli
Alahakoon, Damminda
Bedingfield, Susan
JOURNAL OF THEORETICAL BIOLOGY, 2014, 358 : 31 - 51
[47] Fast alignment-free sequence comparison using spaced-word frequencies
Leimeister, Chris-Andre
Boden, Marcus
Horwege, Sebastian
Lindner, Sebastian
Morgenstern, Burkhard
BIOINFORMATICS, 2014, 30 (14) : 1991 - 1999
[48] Alignment-Free Sequence Comparison Using N-Dimensional Similarity Space
Jayalakshmi, Ramamurthy
Natarajan, Ramanathan
Vivekanandan, Munusamy
Natarajan, Ganapathy S.
CURRENT COMPUTER-AIDED DRUG DESIGN, 2010, 6 (04) : 290 - 296
[49] An effective extension of the applicability of alignment-free biological sequence comparison algorithms with Hadoop
Giuseppe Cattaneo
Umberto Ferraro Petrillo
Raffaele Giancarlo
Gianluca Roscigno
The Journal of Supercomputing, 2017, 73 : 1467 - 1483
[50] Numerical Characterization of DNA Sequences for Alignment-free Sequence Comparison-A Review
Ramanathan, Natarajan
Ramamurthy, Jayalakshmi
Natarajan, Ganapathy
COMBINATORIAL CHEMISTRY & HIGH THROUGHPUT SCREENING, 2022, 25 (03) : 365 - 380

← 1 2 3 4 5 →