A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words

被引:100
|
作者
Wu, TJ [1 ]
Burke, JP
Davison, DB
机构
[1] Natl Donghwa Univ, Dept Math Appl, Hualien, Taiwan
[2] Univ Houston, Dept Math, Houston, TX 77204 USA
[3] Univ Houston, Dept Biochem & Biophys Sci, Houston, TX 77204 USA
关键词
DNA sequences; dissimilarity measures; mahalanobis distance; standardized Euclidean distance;
D O I
10.2307/2533509
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
A number of algorithms exist for searching genetic databases for biologically significant similarities in DNA sequences. Past research has shown that word-based search tools are computationally efficient and can find similarities or dissimilarities invisible to other algorithms like FASTA. We characterize a family of word-based dissimilarity measures that define distance between two sequences by simultaneously comparing the frequencies of all subsequences of n adjacent letters (i.e., n-words) in the two sequences. Applications to real data demonstrate that currently used word-based methods that rely on Euclidean distance can be significantly improved by using Mahalanobis distance, which accounts for both variances and covariances between frequencies of n-words. Furthermore, in those cases where Mahalanobis distance may be too difficult to compute, using standardized Euclidean distance, which only corrects for the variances of frequencies of n-words, still gives better performance than the Euclidean distance. Also, a simple way of combining distances obtained at different n-words is considered. The goal is to obtain a single measure of dissimilarity between two DNA sequences. The performance ranking of the preceding three distances still holds for their combined counterparts. All results obtained in this paper are applicable to amino acid sequences with minor modifications.
引用
收藏
页码:1431 / 1439
页数:9
相关论文
共 50 条
  • [21] Distance based methods of DNA Sequence Analysis in Phylogenetics
    Geetika
    Gaur, Deepti
    Hanmandlu, N.
    2013 2ND INTERNATIONAL CONFERENCE ON INFORMATION MANAGEMENT IN THE KNOWLEDGE ECONOMY (IMKE), 2013, : 42 - 45
  • [22] The relation between entanglement measure and coherence measure based on Hellinger distance
    Liu, Yaxue
    Yang, Lili
    Yan, Donghua
    QUANTUM INFORMATION PROCESSING, 2022, 21 (04)
  • [23] The relation between entanglement measure and coherence measure based on Hellinger distance
    Yaxue Liu
    Lili Yang
    Donghua Yan
    Quantum Information Processing, 21
  • [24] Overcoming Key Weaknesses of Distance-based Neighbourhood Methods using a Data Dependent Dissimilarity Measure
    Ting, Kai Ming
    Zhu, Ye
    Carman, Mark
    Zhu, Yue
    Zhou, Zhi-Hua
    KDD'16: PROCEEDINGS OF THE 22ND ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2016, : 1205 - 1214
  • [25] CESARO PARANORMED SEQUENCE SPACE BASED INTUITIONISTIC FUZZY DISTANCE MEASURE
    Khan, Mohd Shoaib
    Kaushal, Meenakshi
    Lohani, Q. M. Danish
    JOURNAL OF INEQUALITIES AND SPECIAL FUNCTIONS, 2022, 13 (01): : 1 - 13
  • [26] A Laplacian Eigenmaps Based Semantic Similarity Measure between Words
    Wu, Yuming
    Cao, Cungen
    Wang, Shi
    Wang, Dongsheng
    INTELLIGENT INFORMATION PROCESSING V, 2010, 340 : 291 - 296
  • [27] A method for assigning species into groups based on generalized Mahalanobis distance between habitat model coefficients
    Christopher J. Williams
    Patricia J. Heglund
    Environmental and Ecological Statistics, 2009, 16
  • [28] A method for assigning species into groups based on generalized Mahalanobis distance between habitat model coefficients
    Williams, Christopher J.
    Heglund, Patricia J.
    ENVIRONMENTAL AND ECOLOGICAL STATISTICS, 2009, 16 (04) : 495 - 513
  • [29] 3D Face Recognition Based on Symbolic FDA Using SVM Classifier with Similarity and Dissimilarity Distance Measure
    Hiremath, Manjunatha
    Hiremath, P. S.
    INTERNATIONAL JOURNAL OF PATTERN RECOGNITION AND ARTIFICIAL INTELLIGENCE, 2017, 31 (04)
  • [30] Distance-based phenotypic association analysis of DNA sequence data
    Doyoung Chung
    Qunyuan Zhang
    Aldi T Kraja
    Ingrid B Borecki
    Michael A Province
    BMC Proceedings, 5 (Suppl 9)