A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words

被引:100
|
作者
Wu, TJ [1 ]
Burke, JP
Davison, DB
机构
[1] Natl Donghwa Univ, Dept Math Appl, Hualien, Taiwan
[2] Univ Houston, Dept Math, Houston, TX 77204 USA
[3] Univ Houston, Dept Biochem & Biophys Sci, Houston, TX 77204 USA
关键词
DNA sequences; dissimilarity measures; mahalanobis distance; standardized Euclidean distance;
D O I
10.2307/2533509
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
A number of algorithms exist for searching genetic databases for biologically significant similarities in DNA sequences. Past research has shown that word-based search tools are computationally efficient and can find similarities or dissimilarities invisible to other algorithms like FASTA. We characterize a family of word-based dissimilarity measures that define distance between two sequences by simultaneously comparing the frequencies of all subsequences of n adjacent letters (i.e., n-words) in the two sequences. Applications to real data demonstrate that currently used word-based methods that rely on Euclidean distance can be significantly improved by using Mahalanobis distance, which accounts for both variances and covariances between frequencies of n-words. Furthermore, in those cases where Mahalanobis distance may be too difficult to compute, using standardized Euclidean distance, which only corrects for the variances of frequencies of n-words, still gives better performance than the Euclidean distance. Also, a simple way of combining distances obtained at different n-words is considered. The goal is to obtain a single measure of dissimilarity between two DNA sequences. The performance ranking of the preceding three distances still holds for their combined counterparts. All results obtained in this paper are applicable to amino acid sequences with minor modifications.
引用
收藏
页码:1431 / 1439
页数:9
相关论文
共 50 条
  • [1] A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words
    Wu, T.-J.
    Burke, J. P.
    Davison, D. B.
    Biometrics, 53 (04):
  • [2] WSE, a new sequence distance measure based on word frequencies
    Wang, Jun
    Zheng, Xiaoqi
    MATHEMATICAL BIOSCIENCES, 2008, 215 (01) : 78 - 83
  • [3] A cardinal dissensus measure based on the Mahalanobis distance
    Gonzalez-Arteaga, T.
    Alcantud, J. C. R.
    de Andres Calle, R.
    EUROPEAN JOURNAL OF OPERATIONAL RESEARCH, 2016, 251 (02) : 575 - 585
  • [4] Numerical characteristics of word frequencies and their application to dissimilarity measure for sequence comparison
    Dai, Qi
    Liu, Xiaoqing
    Yao, Yuhua
    Zhao, Fukun
    JOURNAL OF THEORETICAL BIOLOGY, 2011, 276 (01) : 174 - 180
  • [5] A Measure of DNA Sequence Dissimilarity Based on Free Energy of Nearest-neighbor Interaction
    Zhang, Yusen
    Chen, Wei
    JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS, 2011, 28 (04): : 557 - 565
  • [6] Mahalanobis distance similarity measure based distinguisher for template attack
    Zhang, Hailong
    Zhou, Yongbin
    Feng, Dengguo
    SECURITY AND COMMUNICATION NETWORKS, 2015, 8 (05) : 769 - 777
  • [7] An Approach to Online Fuzzy Clustering Based on the Mahalanobis Distance Measure
    Hu, Zhengbing
    Tyshchenko, Oleksii K.
    ADVANCES IN INTELLIGENT SYSTEMS, COMPUTER SCIENCE AND DIGITAL ECONOMICS, 2020, 1127 : 364 - 374
  • [8] BHATTACHARYYA DISTANCE BASED EMOTIONAL DISSIMILARITY MEASURE FOR EMOTION CLASSIFICATION
    Tin Lay Nwe
    Nguyen Trung Hieu
    Limbu, Dilip Kumar
    2013 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2013, : 7512 - 7516
  • [9] Mahalanobis Distance Similarity Measure Based Higher Order Optimal Distinguisher
    Zhang, Hailong
    Zhou, Yongbin
    COMPUTER JOURNAL, 2017, 60 (08): : 1131 - 1144
  • [10] Morphological operators for color image processing based on Mahalanobis distance measure
    Al-Otum, HM
    OPTICAL ENGINEERING, 2003, 42 (09) : 2595 - 2606