A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

被引：29

作者：

Luczak, Brian B. ^{[1
]}

James, Benjamin T. ^{[2
]}

Girgis, Hani Z. ^{[3
,4
,5
,6
]}

机构：

[1] Univ Tulsa TU, Tulsa, OK USA

[2] TU, Tulsa, OK USA

[3] TU, Comp Sci, Tulsa, OK USA

[4] NIH, Bldg 10, Bethesda, MD 20892 USA

[5] Johns Hopkins Univ, Baltimore, MD 21218 USA

[6] SUNY Buffalo, Buffalo, NY USA

来源：

BRIEFINGS IN BIOINFORMATICS | 2019年 / 20卷 / 04期

关键词：

alignment-free k-mer statistics; DNA sequence comparison; k-mer histograms; paired statistics; SIMILARITY; DISTANCE; SEARCH; MODELS; GENOME;

D O I：

10.1093/bib/bbx161

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. Results: We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover's distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover's distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. Availability: The source code of the benchmarking tool is available as Supplementary Materials.

引用

页码：1222 / 1237

页数：16

共 50 条

[1] Alignment-Free Sequence Comparison (I): Statistics and Power
Reinert, Gesine
Chew, David
Sun, Fengzhu
Waterman, Michael S.
JOURNAL OF COMPUTATIONAL BIOLOGY, 2009, 16 (12) : 1615 - 1634
[2] Alignment-Free Sequence Comparison (II): Theoretical Power of Comparison Statistics
Wan, Lin
Reinert, Gesine
Sun, Fengzhu
Waterman, Michael S.
JOURNAL OF COMPUTATIONAL BIOLOGY, 2010, 17 (11) : 1467 - +
[3] Alignment-free sequence comparison - a review
Vinga, S
Almeida, J
BIOINFORMATICS, 2003, 19 (04) : 513 - 523
[4] Multiple alignment-free sequence comparison
Ren, Jie
Song, Kai
Sun, Fengzhu
Deng, Minghua
Reinert, Gesine
BIOINFORMATICS, 2013, 29 (21) : 2690 - 2698
[5] Benchmarking of alignment-free sequence comparison methods
Zielezinski, Andrzej
Girgis, Hani Z.
Bernard, Guillaume
Leimeister, Chris-Andre
Tang, Kujin
Dencker, Thomas
Lau, Anna Katharina
Roehling, Sophie
Choi, Jae Jin
Waterman, Michael S.
Comin, Matteo
Kim, Sung-Hou
Vinga, Susana
Almeida, Jonas S.
Chan, Cheong Xin
James, Benjamin T.
Sun, Fengzhu
Morgenstern, Burkhard
Karlowski, Wojciech M.
GENOME BIOLOGY, 2019, 20 (1)
[6] A probabilistic measure for alignment-free sequence comparison
Pham, TD
Zuegg, J
BIOINFORMATICS, 2004, 20 (18) : 3455 - 3461
[7] Benchmarking of alignment-free sequence comparison methods
Andrzej Zielezinski
Hani Z. Girgis
Guillaume Bernard
Chris-Andre Leimeister
Kujin Tang
Thomas Dencker
Anna Katharina Lau
Sophie Röhling
Jae Jin Choi
Michael S. Waterman
Matteo Comin
Sung-Hou Kim
Susana Vinga
Jonas S. Almeida
Cheong Xin Chan
Benjamin T. James
Fengzhu Sun
Burkhard Morgenstern
Wojciech M. Karlowski
Genome Biology, 20
[8] New powerful statistics for alignment-free sequence comparison under a pattern transfer model
Liu, Xuemei
Wan, Lin
Li, Jing
Reinert, Gesine
Waterman, Michael S.
Sun, Fengzhu
JOURNAL OF THEORETICAL BIOLOGY, 2011, 284 (01) : 106 - 116
[9] Alignment-Free Sequence Comparison: A Systematic Survey From a Machine Learning Perspective
Bohnsack, Katrin Sophie
Kaden, Marika
Abel, Julia
Villmann, Thomas
IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2023, 20 (01) : 119 - 135
[10] Weighted measures based on maximizing deviation for alignment-free sequence comparison
Qian, Kun
Luan, Yihui
PHYSICA A-STATISTICAL MECHANICS AND ITS APPLICATIONS, 2017, 481 : 235 - 242

← 1 2 3 4 5 →