Impact of Missing Data on Phylogenies Inferred from Empirical Phylogenomic Data Sets

被引:258
|
作者
Roure, Beatrice [1 ]
Baurain, Denis [2 ,3 ]
Philippe, Herve [1 ]
机构
[1] Univ Montreal, Dept Biochim, Ctr Robert Cedergren, Montreal, PQ H3C 3J7, Canada
[2] Univ Liege, Unit Anim Genom, GIGA R, Liege, Belgium
[3] Univ Liege, Fac Vet Med, Liege, Belgium
基金
加拿大自然科学与工程研究理事会;
关键词
phylogeny; supermatrix; supertree; taxon sampling; tree reconstruction artifact; model parameter estimation; LONG-BRANCH ATTRACTION; MAXIMUM-LIKELIHOOD; ANIMAL PHYLOGENY; BILATERIAN ANIMALS; EVOLUTIONARY TREES; MULTIGENE ANALYSES; INCOMPLETE TAXA; GENES SUPPORTS; MIXTURE MODEL; MIXED MODELS;
D O I
10.1093/molbev/mss208
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Progress in sequencing technology allows researchers to assemble ever-larger supermatrices for phylogenomic inference. However, current phylogenomic studies often rest on patchy data sets, with some having 80% missing (or ambiguous) data or more. Though early simulations had suggested that missing data per se do not harm phylogenetic inference when using sufficiently large data sets, Lemmon et al. (Lemmon AR, Brown JM, Stanger-Hall K, Lemmon EM. 2009. The effect of ambiguous data on phylogenetic estimates obtained by maximum likelihood and Bayesian inference. Syst Biol. 58:130-145.) have recently cast doubt on this consensus in a study based on the introduction of parsimony-uninformative incomplete characters. In this work, we empirically reassess the issue of missing data in phylogenomics while exploring possible interactions with the model of sequence evolution. First, we note that parsimony-uninformative incomplete characters are actually informative in a probabilistic framework. A reanalysis of Lemmon's data set with this in mind gives a very different interpretation of their results and shows that some of their conclusions may be unfounded. Second, we investigate the effect of the progressive introduction of missing data in a complete supermatrix (126 genes x 39 species) capable of resolving animal relationships. These analyses demonstrate that missing data perturb phylogenetic inference slightly beyond the expected decrease in resolving power. In particular, they exacerbate systematic errors by reducing the number of species effectively available for the detection of multiple substitutions. Consequently, large sparse supermatrices are more sensitive to phylogenetic artifacts than smaller but less incomplete data sets, which argue for experimental designs aimed at collecting a modest number (similar to 50) of highly covered genes. Our results further confirm that including incomplete yet short-branch taxa (i.e., slowly evolving species or close outgroups) can help to eschew artifacts, as predicted by simulations. Finally, it appears that selecting an adequate model of sequence evolution (e.g., the site-heterogeneous CAT model instead of the site-homogeneous WAG model) is more beneficial to phylogenetic accuracy than reducing the level of missing data.
引用
收藏
页码:197 / 214
页数:18
相关论文
共 50 条
  • [41] Missing Data in Surgical Data Sets: A Review of Pertinent Issues and Solutions
    Sharath, Sherene E.
    Zamani, Nader
    Kougias, Panos
    Kim, Soeun
    JOURNAL OF SURGICAL RESEARCH, 2018, 232 : 240 - 246
  • [42] Missing entry replacement data analysis: A replacement approach to dealing with missing data in paleontological and total evidence data sets
    Norell, MA
    Wheeler, W
    JOURNAL OF VERTEBRATE PALEONTOLOGY, 2003, 23 (02) : 275 - 283
  • [43] INTEGRATION OF MORPHOLOGICAL AND MOLECULAR-DATA SETS IN ESTIMATING FUNGAL PHYLOGENIES
    LUTZONI, F
    VILGALYS, R
    CANADIAN JOURNAL OF BOTANY-REVUE CANADIENNE DE BOTANIQUE, 1995, 73 : S649 - S659
  • [44] Comparison of phylogenies derived from two molecular data sets in the avian genera Pipilo and Spizella
    Dodge, AG
    Fry, AJ
    Blackwell, RC
    Zink, RM
    WILSON BULLETIN, 1995, 107 (04): : 641 - 654
  • [45] Missing data matter: an empirical evaluation of the impacts of missing EHR data in comparative effectiveness research
    Zhou, Yizhao
    Shi, Jiasheng
    Stein, Ronen
    Liu, Xiaokang
    Baldassano, Robert N.
    Forrest, Christopher B.
    Chen, Yong
    Huang, Jing
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2023, 30 (07) : 1246 - 1256
  • [46] Delimiting Coalescence Genes (C-Genes) in Phylogenomic Data Sets
    Springer, Mark S.
    Gatesy, John
    GENES, 2018, 9 (03):
  • [47] Phylogenomic evidence of bryophytes' monophyly using complete and incomplete data sets from chloroplast proteomes
    Shanker, Asheesh
    Sharma, Vinay
    Daniell, Henry
    JOURNAL OF PLANT BIOCHEMISTRY AND BIOTECHNOLOGY, 2011, 20 (02) : 288 - 292
  • [48] Effects of visualizing missing data: an empirical evaluation
    Andreasson, Rebecca
    Riveiro, Maria
    2014 18TH INTERNATIONAL CONFERENCE ON INFORMATION VISUALISATION (IV), 2014, : 132 - 138
  • [49] Phylogenomic evidence of bryophytes’ monophyly using complete and incomplete data sets from chloroplast proteomes
    Asheesh Shanker
    Vinay Sharma
    Henry Daniell
    Journal of Plant Biochemistry and Biotechnology, 2011, 20 : 288 - 292
  • [50] ComPhy: prokaryotic composite distance phylogenies inferred from whole-genome gene sets
    Lin, Guan Ning
    Cai, Zhipeng
    Lin, Guohui
    Chakraborty, Sounak
    Xu, Dong
    BMC BIOINFORMATICS, 2009, 10