A measure of discrepancy of multiple sequences

被引:27
|
作者
Fang, WW
Roberts, FS
Ma, ZR
机构
[1] Chinese Acad Sci, Inst Appl Math, Acad Math & Syst Sci, Beijing 100080, Peoples R China
[2] Rutgers State Univ, Ctr Discrete Math, Piscataway, NJ 08855 USA
[3] Rutgers State Univ, Theoret Comp Sci Ctr, DIMACS, Piscataway, NJ 08855 USA
[4] Rutgers State Univ, Waksman Inst Microbiol, Piscataway, NJ 08855 USA
关键词
multiple sequence comparison; entropy; DNA; information discrepancy;
D O I
10.1016/S0020-0255(01)00108-6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Multiple sequence comparison is a basic problem for molecular biology and other sciences. In this paper, we introduce the concept of complete information set and some measurement principles for measuring discrepancy among multiple sequences. Based on them, we present a new measurement method satisfying the principles for comparing multiple sequences. We illustrate that this method can effectively distinguish different random sequences or DNA sequences of length 8000 by comparisons of 6-8 symbol (base) strings or protein sequences of length 8000 by comparisons of 3-4 symbol (amino acid) strings. It can also measure slight changes of a sequence, e.g., insertion or deletion of a symbol (a base or an amino acid) in a sequence. It is applied in the study of molecular evolution, and the elementary result shows a hierarchic relationship among the cytochrome C protein sequences of different species, much as that in taxonomy. (C) 2001 Elsevier Science Inc. All rights reserved.
引用
收藏
页码:75 / 102
页数:28
相关论文
共 50 条