Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets

被引:258
|
作者
Roure, Beatrice [1 ]
Baurain, Denis [2 ,3 ]
Philippe, Herve [1 ]
机构
[1] Univ Montreal, Dept Biochim, Ctr Robert Cedergren, Montreal, PQ H3C 3J7, Canada
[2] Univ Liege, Unit Anim Genom, GIGA R, Liege, Belgium
[3] Univ Liege, Fac Vet Med, Liege, Belgium
基金
加拿大自然科学与工程研究理事会;
关键词
phylogeny; supermatrix; supertree; taxon sampling; tree reconstruction artifact; model parameter estimation; LONG-BRANCH ATTRACTION; MAXIMUM-LIKELIHOOD; ANIMAL PHYLOGENY; BILATERIAN ANIMALS; EVOLUTIONARY TREES; MULTIGENE ANALYSES; INCOMPLETE TAXA; GENES SUPPORTS; MIXTURE MODEL; MIXED MODELS;
D O I
10.1093/molbev/mss208
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Progress in sequencing technology allows researchers to assemble ever-larger supermatrices for phylogenomic inference. However, current phylogenomic studies often rest on patchy data sets, with some having 80% missing (or ambiguous) data or more. Though early simulations had suggested that missing data per se do not harm phylogenetic inference when using sufficiently large data sets, Lemmon et al. (Lemmon AR, Brown JM, Stanger-Hall K, Lemmon EM. 2009. The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference. Syst Biol. 58:130-145.) have recently cast doubt on this consensus in a study based on the introduction of parsimony-uninformative incomplete characters. In this work, we empirically reassess the issue of missing data in phylogenomics while exploring possible interactions with the model of sequence evolution. First, we note that parsimony-uninformative incomplete characters are actually informative in a probabilistic framework. A reanalysis of Lemmon's data set with this in mind gives a very different interpretation of their results and shows that some of their conclusions may be unfounded. Second, we investigate the effect of the progressive introduction of missing data in a complete supermatrix (126 genes x 39 species) capable of resolving animal relationships. These analyses demonstrate that missing data perturb phylogenetic inference slightly beyond the expected decrease in resolving power. In particular, they exacerbate systematic errors by reducing the number of species effectively available for the detection of multiple substitutions. Consequently, large sparse supermatrices are more sensitive to phylogenetic artifacts than smaller but less incomplete data sets, which argue for experimental designs aimed at collecting a modest number (similar to 50) of highly covered genes. Our results further confirm that including incomplete yet short-branch taxa (i.e., slowly evolving species or close outgroups) can help to eschew artifacts, as predicted by simulations. Finally, it appears that selecting an adequate model of sequence evolution (e.g., the site-heterogeneous CAT model instead of the site-homogeneous WAG model) is more beneficial to phylogenetic accuracy than reducing the level of missing data.
引用
收藏
页码:197 / 214
页数:18
相关论文
共 50 条
  • [21] Analyzing data sets with missing data: An empirical evaluation of imputation methods and likelihood-based methods
    Myrtveit, I
    Stensrud, E
    Olsson, UH
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2001, 27 (11) : 999 - 1013
  • [22] Large rDNA data sets question molecular phylogenies
    不详
    MYCOLOGICAL RESEARCH, 2000, 104 : 1026 - 1026
  • [23] How Should Genes and Taxa be Sampled for Phylogenomic Analyses with Missing Data? An Empirical Study in Iguanian Lizards
    Streicher, Jeffrey W.
    Schulte, James A., II
    Wiens, John J.
    SYSTEMATIC BIOLOGY, 2016, 65 (01) : 128 - 145
  • [24] On Testability of Missing Data Mechanisms in Incomplete Data Sets
    Raykov, Tenko
    STRUCTURAL EQUATION MODELING-A MULTIDISCIPLINARY JOURNAL, 2011, 18 (03) : 419 - 429
  • [25] Emergence and Evolution of Modern Molecular Functions Inferred from Phylogenomic Analysis of Ontological Data
    Kim, Kyung Mo
    Caetano-Anolles, Gustavo
    MOLECULAR BIOLOGY AND EVOLUTION, 2010, 27 (07) : 1710 - 1733
  • [26] Missing Data in Phylogenetic Analysis: Reconciling Results from Simulations and Empirical Data
    Wiens, John J.
    Morrill, Matthew C.
    SYSTEMATIC BIOLOGY, 2011, 60 (05) : 719 - 731
  • [27] Data mining and the impact of missing data
    Brown, ML
    Kros, JF
    INDUSTRIAL MANAGEMENT & DATA SYSTEMS, 2003, 103 (8-9) : 611 - 621
  • [28] Model Choice, Missing Data, and Taxon Sampling Impact Phylogenomic Inference of Deep Basidiomycota Relationships
    Prasanna, Arun N.
    Gerber, Daniel
    Kijpornyongpan, Teeratas
    Aime, M. Catherine
    Doyle, Vinson P.
    Nagy, Laszlo G.
    SYSTEMATIC BIOLOGY, 2020, 69 (01) : 17 - 37
  • [29] Evaluating Fast Maximum Likelihood-Based Phylogenetic Programs Using Empirical Phylogenomic Data Sets
    Zhou, Xiaofan
    Shen, Xing-Xing
    Hittinger, Chris Todd
    Rokas, Antonis
    MOLECULAR BIOLOGY AND EVOLUTION, 2018, 35 (02) : 486 - 503
  • [30] The Reliability and Stability of an Inferred Phylogenetic Tree from Empirical Data
    Katsura, Yukako
    Stanley, Craig E., Jr.
    Kumar, Sudhir
    Nei, Masatoshi
    MOLECULAR BIOLOGY AND EVOLUTION, 2017, 34 (03) : 718 - 723