A survey and evaluations of histogram-based statistics in alignment-free sequence comparison

被引：29

作者：

Luczak, Brian B. ^{[1
]}

James, Benjamin T. ^{[2
]}

Girgis, Hani Z. ^{[3
,4
,5
,6
]}

机构：

[1] Univ Tulsa TU, Tulsa, OK USA

[2] TU, Tulsa, OK USA

[3] TU, Comp Sci, Tulsa, OK USA

[4] NIH, Bldg 10, Bethesda, MD 20892 USA

[5] Johns Hopkins Univ, Baltimore, MD 21218 USA

[6] SUNY Buffalo, Buffalo, NY USA

来源：

BRIEFINGS IN BIOINFORMATICS | 2019年 / 20卷 / 04期

关键词：

alignment-free k-mer statistics; DNA sequence comparison; k-mer histograms; paired statistics; SIMILARITY; DISTANCE; SEARCH; MODELS; GENOME;

D O I：

10.1093/bib/bbx161

中图分类号：

Q5 [生物化学];

学科分类号：

071010 ; 081704 ;

摘要：

Motivation: Since the dawn of the bioinformatics field, sequence alignment scores have been the main method for comparing sequences. However, alignment algorithms are quadratic, requiring long execution time. As alternatives, scientists have developed tens of alignment-free statistics for measuring the similarity between two sequences. Results: We surveyed tens of alignment-free k-mer statistics. Additionally, we evaluated 33 statistics and multiplicative combinations between the statistics and/or their squares. These statistics are calculated on two k-mer histograms representing two sequences. Our evaluations using global alignment scores revealed that the majority of the statistics are sensitive and capable of finding similar sequences to a query sequence. Therefore, any of these statistics can filter out dissimilar sequences quickly. Further, we observed that multiplicative combinations of the statistics are highly correlated with the identity score. Furthermore, combinations involving sequence length difference or Earth Mover's distance, which takes the length difference into account, are always among the highest correlated paired statistics with identity scores. Similarly, paired statistics including length difference or Earth Mover's distance are among the best performers in finding the K-closest sequences. Interestingly, similar performance can be obtained using histograms of shorter words, resulting in reducing the memory requirement and increasing the speed remarkably. Moreover, we found that simple single statistics are sufficient for processing next-generation sequencing reads and for applications relying on local alignment. Finally, we measured the time requirement of each statistic. The survey and the evaluations will help scientists with identifying efficient alternatives to the costly alignment algorithm, saving thousands of computational hours. Availability: The source code of the benchmarking tool is available as Supplementary Materials.

引用

页码：1222 / 1237

页数：16

共 50 条

[21] An Algorithm for Alignment-free Sequence Comparison using Logical Match
Shanker, Sanil
Austin, Jim
Sherly, Elizabeth
2010 2ND INTERNATIONAL CONFERENCE ON COMPUTER AND AUTOMATION ENGINEERING (ICCAE 2010), VOL 3, 2010, : 536 - 538
[22] Statistical considerations underpinning an alignment-free sequence comparison method
Junmei Jing
Conrad J. Burden
Sylvain Forêt
Susan R. Wilson
Journal of the Korean Statistical Society, 2010, 39 : 325 - 335
[23] Alignment-Free Sequence Comparison over Hadoop for Computational Biology
Cattaneo, Giuseppe
Petrillo, Umberto Ferraro
Giancarlo, Raffaele
Roscigno, Gianluca
2015 44TH INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING WORKSHOPS, 2015, : 184 - 192
[24] Statistical considerations underpinning an alignment-free sequence comparison method
Jing, Junmei
Burden, Conrad J.
Foret, Sylvain
Wilson, Susan R.
JOURNAL OF THE KOREAN STATISTICAL SOCIETY, 2010, 39 (03) : 325 - 335
[25] Variable length local decoding and alignment-free sequence comparison
Didier, Gilles
Corel, Eduardo
Laprevotte, Ivan
Grossmann, Alex
Landes-Devauchelle, Claudine
THEORETICAL COMPUTER SCIENCE, 2012, 462 : 1 - 11
[26] Positional difference and Frequency (PdF) based alignment-free technique for genome sequence comparison
Dey, Sudeshna
Ghosh, Papri
Das, Subhram
JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS, 2024, 42 (23): : 12660 - 12688
[27] Application of Sequence Alignment-Free Comparison-Based SeqDistK to Microbial Flora Clustering
Liu X.
Huang G.
Huang T.
Huanan Ligong Daxue Xuebao/Journal of South China University of Technology (Natural Science), 2019, 47 (11): : 71 - 77
[28] An alignment-free measure based on physicochemical properties of amino acids for protein sequence comparison
Zhao, Yunxiu
Xue, Xiaolong
Xie, Xiaoli
COMPUTATIONAL BIOLOGY AND CHEMISTRY, 2019, 80 : 10 - 15
[29] Alignment-free Sequence Comparison for Biologically Realistic Sequences of Moderate Length
Burden, Conrad J.
Jing, Junmei
Wilson, Susan R.
STATISTICAL APPLICATIONS IN GENETICS AND MOLECULAR BIOLOGY, 2012, 11 (01)
[30] SENSE: Siamese neural network for sequence embedding and alignment-free comparison
Zheng, Wei
Yang, Le
Genco, Robert J.
Wactawski-Wende, Jean
Buck, Michael
Sun, Yijun
BIOINFORMATICS, 2019, 35 (11) : 1820 - 1828

← 1 2 3 4 5 →