A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words

被引:100
|
作者
Wu, TJ [1 ]
Burke, JP
Davison, DB
机构
[1] Natl Donghwa Univ, Dept Math Appl, Hualien, Taiwan
[2] Univ Houston, Dept Math, Houston, TX 77204 USA
[3] Univ Houston, Dept Biochem & Biophys Sci, Houston, TX 77204 USA
关键词
DNA sequences; dissimilarity measures; mahalanobis distance; standardized Euclidean distance;
D O I
10.2307/2533509
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
A number of algorithms exist for searching genetic databases for biologically significant similarities in DNA sequences. Past research has shown that word-based search tools are computationally efficient and can find similarities or dissimilarities invisible to other algorithms like FASTA. We characterize a family of word-based dissimilarity measures that define distance between two sequences by simultaneously comparing the frequencies of all subsequences of n adjacent letters (i.e., n-words) in the two sequences. Applications to real data demonstrate that currently used word-based methods that rely on Euclidean distance can be significantly improved by using Mahalanobis distance, which accounts for both variances and covariances between frequencies of n-words. Furthermore, in those cases where Mahalanobis distance may be too difficult to compute, using standardized Euclidean distance, which only corrects for the variances of frequencies of n-words, still gives better performance than the Euclidean distance. Also, a simple way of combining distances obtained at different n-words is considered. The goal is to obtain a single measure of dissimilarity between two DNA sequences. The performance ranking of the preceding three distances still holds for their combined counterparts. All results obtained in this paper are applicable to amino acid sequences with minor modifications.
引用
收藏
页码:1431 / 1439
页数:9
相关论文
共 50 条
  • [31] Clustering of Expressed Sequence Tags with Distance Measure Based on Burrows-Wheeler Transform
    Keng-Hoong Ng
    Phon-Amnuaisuk, Somnuk
    Ho, Chin-Kuan
    2010 3RD INTERNATIONAL CONFERENCE ON BIOMEDICAL ENGINEERING AND INFORMATICS (BMEI 2010), VOLS 1-7, 2010, : 2183 - 2187
  • [32] A Measure of Protein Sequence Characteristics Based on the Frequency and the Position Entropy of Existing K-words
    Qi, Zhao-Hui
    Jin, Meng-Zhe
    Yang, Hong
    MATCH-COMMUNICATIONS IN MATHEMATICAL AND IN COMPUTER CHEMISTRY, 2015, 73 (03) : 731 - 748
  • [33] Improved determination of tie weights in a clustering approach based on a weighted dissimilarity measure between fuzzy data
    Eskandari, Elham
    Khastan, Alireza
    Tomasiello, Stefania
    2022 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS (FUZZ-IEEE), 2022,
  • [34] A Semantic Similarity Measure Between Web Services Based on Google Distance
    Yang, Huirong
    Fu, Pengbin
    Yin, Baocai
    Ma, Mengduo
    Tang, Yanyan
    2011 35TH IEEE ANNUAL INTERNATIONAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), 2011, : 14 - 19
  • [35] A Grammar-Based Behavioral Distance Measure Between Ransomware Variants
    Parunak, H. Van Dyke
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2022, 9 (01): : 8 - 17
  • [36] A hybrid Web-based measure for computing semantic relatedness between words
    Spanakis, Gerasimos
    Siolas, Georgios
    Stafylopatis, Andreas
    ICTAI: 2009 21ST INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, 2009, : 441 - 448
  • [37] Integrated Waterflooding Effect Evaluation Methodology for Carbonate Fractured-Vuggy Reservoirs Based on the Unascertained Measure-Mahalanobis Distance Theory
    Su, Zezhong
    Gao, Shihui
    Li, Zhiyuan
    Li, Tiantai
    Kang, Nan
    PROCESSES, 2024, 12 (02)
  • [38] Extracting DNA words based on the sequence features: non-uniform distribution and integrity
    Li, Zhi
    Cao, Hongyan
    Cui, Yuehua
    Zhang, Yanbo
    THEORETICAL BIOLOGY AND MEDICAL MODELLING, 2016, 13
  • [39] A new distance measure for comparing sequence profiles based on path lengths along an entropy surface
    Benson, G
    BIOINFORMATICS, 2002, 18 : S44 - S53
  • [40] DNA word analysis based on the distribution of the distances between symmetric words
    Tavares, Ana H. M. P.
    Pinho, Armando J.
    Silva, Raquel M.
    Rodrigues, Joao M. O. S.
    Bastos, Carlos A. C.
    Ferreira, Paulo J. S. G.
    Afreixo, Vera
    SCIENTIFIC REPORTS, 2017, 7