SEQUENCE ORDINATIONS - A MULTIVARIATE-ANALYSIS APPROACH TO ANALYZING LARGE SEQUENCE DATA SETS

被引:0
|
作者
HIGGINS, DG
机构
来源
关键词
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Ordination is a powerful method for analysing complex data sets but has been largely ignored in sequence analysis. This paper shows how to use principal coordinates analysis to find low-dimensional representations of distance matrices derived from aligned sets of sequences. The method takes a matrix of Euclidean distances between all pairs of sequence and finds a coordinate space where the distances are exactly preserved. The main problem is to find a measure of distance between aligned sequences that is Euclidean. The simplest distance function is the square root of the percentage difference (as measured by identities) between two sequences, where one ignores any positions in the alignment where there is a gap in any sequence. If one does not ignore positions with a gap, the distances cannot be guaranteed to be Euclidean but the deleterious effects are trivial. Two examples of using the method are shown. A set of 226 aligned globins were analysed and the resulting ordination very successfully represents the known patterns of relationship between the sequences. In the other example, a set of 610 aligned 5S rRNA sequences were analysed. Sequence ordinations complement phylogenetic analyses. They should not be viewed as a complete alternative.
引用
收藏
页码:15 / 22
页数:8
相关论文
共 50 条
  • [1] MANAGEMENT AND MULTIVARIATE-ANALYSIS OF LARGE DATA SETS IN VEGETATION RESEARCH
    WILDI, O
    [J]. VEGETATIO, 1980, 42 (1-3): : 175 - 180
  • [2] APPLICATION OF MULTIVARIATE-ANALYSIS TO ENVIRONMENTAL DATA SETS
    HOPKE, PK
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 1975, 170 (AUG24): : 10 - 10
  • [3] Outlying Sequence Detection in Large Data Sets
    Tajer, Ali
    Veeravalli, Venugopal V.
    Poor, H. Vincent
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2014, 31 (05) : 44 - 56
  • [4] DnaSP 6: DNA Sequence Polymorphism Analysis of Large Data Sets
    Rozas, Julio
    Ferrer-Mata, Albert
    Carlos Sanchez-DelBarrio, Juan
    Guirao-Rico, Sara
    Librado, Pablo
    Ramos-Onsins, Sebastian E.
    Sanchez-Gracia, Alejandro
    [J]. MOLECULAR BIOLOGY AND EVOLUTION, 2017, 34 (12) : 3299 - 3302
  • [5] ANALYSIS OF MULTIVARIATE DATA - MULTIVARIATE-ANALYSIS OF REGRESSION
    MAGER, PP
    MAGER, H
    [J]. BIOMETRISCHE ZEITSCHRIFT, 1975, 17 (05): : 325 - 328
  • [6] An effective approach for analyzing "prefinished" genomic sequence data
    Kuehl, PM
    Weisemann, JM
    Touchman, JW
    Green, ED
    Boguski, MS
    [J]. GENOME RESEARCH, 1999, 9 (02) : 189 - 194
  • [7] A TARGETED APPROACH FOR ANALYZING LARGE LIPIDOMIC DATA SETS
    Paulson, D.
    Mazzer, P.
    [J]. PROCEEDINGS OF THE SOUTH DAKOTA ACADEMY OF SCIENCE, VOL 96, 2017, 96 : 223 - 223
  • [8] THE COMPLEMENTARY USE OF CHAID AND MNA (MULTIVARIATE NOMINAL SCALE ANALYSIS) IN ANALYZING LARGE DATA SETS
    SHAW, T
    STUMPF, RH
    [J]. SOUTH AFRICAN STATISTICAL JOURNAL, 1984, 18 (02) : 198 - 198
  • [9] MULTIVARIATE-ANALYSIS OF QUALITATIVE DATA
    TAYLOR, KW
    CHAPPELL, NL
    [J]. CANADIAN REVIEW OF SOCIOLOGY AND ANTHROPOLOGY-REVUE CANADIENNE DE SOCIOLOGIE ET D ANTHROPOLOGIE, 1980, 17 (02): : 93 - 108
  • [10] MULTIVARIATE-ANALYSIS OF PECVD DATA
    DOSE, V
    [J]. APPLIED PHYSICS A-MATERIALS SCIENCE & PROCESSING, 1993, 56 (06): : 471 - 477