The Multispecies Coalescent Model Outperforms Concatenation Across Diverse Phylogenomic Data Sets

被引:46
|
作者
Jiang, Xiaodong [1 ]
Edwards, Scott, V [2 ,3 ]
Liu, Liang [1 ,4 ]
机构
[1] Univ Georgia, Dept Stat, 310 Herty Dr, Athens, GA 30602 USA
[2] Harvard, Dept Organism & Evolutionary Biol, 26 Oxford St, Cambridge, MA 02138 USA
[3] Harvard, Museum Comparat Zool, 26 Oxford St, Cambridge, MA 02138 USA
[4] Univ Georgia, Inst Bioinformat, 120 Green St, Athens, GA 30602 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
Bayes factor; Bayesian model validation; coalescent prior; congruent gene trees; independent prior; Metazoa; posterior predictive simulation; ESTIMATING SPECIES TREES; GENE TREES; SEQUENCE DATA; DNA-SEQUENCES; INFERENCE; MITOCHONDRIAL; PERFORMANCE; EVOLUTION; SISTER; SITES;
D O I
10.1093/sysbio/syaa008
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
A statistical framework of model comparison and model validation is essential to resolving the debates over concatenation and coalescent models in phylogenomic data analysis. A set of statistical tests are here applied and developed to evaluate and compare the adequacy of substitution, concatenation, and multispecies coalescent (MSC) models across 47 phylogenomic data sets collected across tree of life. Tests for substitution models and the concatenation assumption of topologically congruent gene trees suggest that a poor fit of substitution models, rejected by 44% of loci, and concatenation models, rejected by 38% of loci, is widespread. Logistic regression shows that the proportions of GC content and informative sites are both negatively correlated with the fit of substitution models across loci. Moreover, a substantial violation of the concatenation assumption of congruent gene trees is consistently observed across six major groups (birds, mammals, fish, insects, reptiles, and others, including other invertebrates). In contrast, among those loci adequately described by a given substitution model, the proportion of loci rejecting the MSC model is 11%, significantly lower than those rejecting the substitution and concatenation models. Although conducted on reduced data sets due to computational constraints, Bayesian model validation and comparison both strongly favor the MSC over concatenation across all data sets; the concatenation assumption of congruent gene trees rarely holds for phylogenomic data sets with more than 10 loci. Thus, for large phylogenomic data sets, model comparisons are expected to consistently and more strongly favor the coalescent model over the concatenation model. We also found that loci rejecting the MSC have little effect on species tree estimation. Our study reveals the value of model validation and comparison in phylogenomic data analysis, as well as the need for further improvements of multilocus models and computational tools for phylogenetic inference.
引用
收藏
页码:795 / 812
页数:18
相关论文
共 18 条
  • [1] A Simulation Study to Examine the Information Content in Phylogenomic Data Sets under the Multispecies Coalescent Model
    Huang, Jun
    Flouri, Tomas
    Yang, Ziheng
    [J]. MOLECULAR BIOLOGY AND EVOLUTION, 2020, 37 (11) : 3211 - 3224
  • [2] A Bayesian Implementation of the Multispecies Coalescent Model with Introgression for Phylogenomic Analysis
    Flouri, Tomas
    Jiao, Xiyun
    Rannala, Bruce
    Yang, Ziheng
    [J]. MOLECULAR BIOLOGY AND EVOLUTION, 2020, 37 (04) : 1211 - 1223
  • [3] A simulation study to examine the impact of recombination on phylogenomic inferences under the multispecies coalescent model
    Zhu, Tianqi
    Flouri, Tomas
    Yang, Ziheng
    [J]. MOLECULAR ECOLOGY, 2022, 31 (10) : 2814 - 2829
  • [4] A Bayesian Implementation of the Multispecies Coalescent Model with Introgression for Phylogenomic Analysis (vol 37, pg 1211, 2020)
    Flouri, Tomas
    Jiao, Xiyun
    Rannala, Bruce
    Yang, Ziheng
    [J]. MOLECULAR BIOLOGY AND EVOLUTION, 2022, 39 (11)
  • [5] Phase Resolution of Heterozygous Sites in Diploid Genomes is Important to Phylogenomic Analysis under the Multispecies Coalescent Model
    Huang, Jun
    Bennett, Jeremy
    Flouri, Tomas
    Leache, Adam D.
    Yang, Ziheng
    [J]. SYSTEMATIC BIOLOGY, 2022, 71 (02) : 334 - 352
  • [6] A community approach to data integration: Authorship and building meaningful links across diverse archaeological data sets
    Kansa, Eric
    [J]. GEOSPHERE, 2005, 1 (02): : 97 - 109
  • [7] Modified K-Neighbor Outperforms Logistic Regression and Random Forest in Identifying Host Malware Across Limited Data Sets
    Rai, Manish Kumar
    Haripriya, K.
    Sharma, Priyanka
    [J]. ADVANCED NETWORK TECHNOLOGIES AND INTELLIGENT COMPUTING, ANTIC 2022, PT I, 2023, 1797 : 108 - 124
  • [8] TopScore: Using Deep Neural Networks and Large Diverse Data Sets for Accurate Protein Model Quality Assessment
    Mulnaes, Daniel
    Gohlke, Holger
    [J]. JOURNAL OF CHEMICAL THEORY AND COMPUTATION, 2018, 14 (11) : 6117 - 6126
  • [9] Uncertainty analysis and evaluation of a complex, multi-specific weed dynamics model with diverse and incomplete data sets
    Colbach, Nathalie
    Bertrand, Michel
    Busset, Hugues
    Colas, Floriane
    Dugue, Francois
    Farcy, Pascal
    Fried, Guillaume
    Granger, Sylvie
    Meunier, Dominique
    Munier-Jolain, Nicolas M.
    Noilhan, Camille
    Strbik, Florence
    Gardarin, Antoine
    [J]. ENVIRONMENTAL MODELLING & SOFTWARE, 2016, 86 : 184 - 203
  • [10] The use of physiologically based models to integrate diverse data sets and reduce uncertainty in the prediction of perchlorate and iodide kinetics across life stages and species
    Clewell, RA
    Merrill, EA
    Robinson, PJ
    [J]. TOXICOLOGY AND INDUSTRIAL HEALTH, 2001, 17 (5-10) : 210 - 222