Interpreting tree ensemble machine learning models with endoR

被引:2
|
作者
Ruaud, Albane [1 ]
Pfister, Niklas [2 ]
Ley, Ruth E. [1 ]
Youngblut, Nicholas D. [1 ]
机构
[1] Max Planck Inst Dev Biol, Dept Microbiome Sci, Tubingen, Germany
[2] Univ Copenhagen, Dept Math Sci, Copenhagen, Denmark
关键词
HUMAN GUT MICROBIOME; METHANOBREVIBACTER-SMITHII; COLONIC TRANSIT; OBESITY; SELECTION; FERMENTATION; SIGNATURE; BACTERIA; ARCHAEA; METHANE;
D O I
10.1371/journal.pcbi.1010714
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Tree ensemble machine learning models are increasingly used in microbiome science as they are compatible with the compositional, high-dimensional, and sparse structure of sequence-based microbiome data. While such models are often good at predicting phenotypes based on microbiome data, they only yield limited insights into how microbial taxa may be associated. We developed endoR, a method to interpret tree ensemble models. First, endoR simplifies the fitted model into a decision ensemble. Then, it extracts information on the importance of individual features and their pairwise interactions, displaying them as an interpretable network. Both the endoR network and importance scores provide insights into how features, and interactions between them, contribute to the predictive performance of the fitted model. Adjustable regularization and bootstrapping help reduce the complexity and ensure that only essential parts of the model are retained. We assessed endoR on both simulated and real metagenomic data. We found endoR to have comparable accuracy to other common approaches while easing and enhancing model interpretation. Using endoR, we also confirmed published results on gut microbiome differences between cirrhotic and healthy individuals. Finally, we utilized endoR to explore associations between human gut methanogens and microbiome components. Indeed, these hydrogen consumers are expected to interact with fermenting bacteria in a complex syntrophic network. Specifically, we analyzed a global metagenome dataset of 2203 individuals and confirmed the previously reported association between Methanobacteriaceae and Christensenellales. Additionally, we observed that Methanobacteriaceae are associated with a network of hydrogen-producing bacteria. Our method accurately captures how tree ensembles use features and interactions between them to predict a response. As demonstrated by our applications, the resultant visualizations and summary outputs facilitate model interpretation and enable the generation of novel hypotheses about complex systems. Author summary Machine learning models have proven to be successful at predicting diseases and other human phenotypes from microbiome data; however, gaining insight from such complex models is often challenging. To this end, we developed endoR, an R-package for enhanced interpretation of tree ensemble models (e.g., random forests), the most popular and highest-performing machine learning models applied to microbiome data to date. Our method simplifies models and extracts information on associations between microbiome data, host metadata and covariates, and a predicted trait (e.g., disease versus healthy). endoR has two main strengths: i) the ability to capture interactions between predictors, and ii) regularization steps that avoid overfitting. Through extensive validations, we show that endoR is comparable in accuracy to other common approaches while easing and enhancing model interpretation. We applied endoR to gain insight into a complex syntrophic network of human gut methanogens and bacterial fermenters. Overall, endoR is a powerful tool for gaining insight from tree ensemble models applied to microbiome data.
引用
收藏
页数:39
相关论文
共 50 条
  • [1] ENSEMBLE TREE MACHINE LEARNING MODELS FOR IMPROVEMENT OF EUROCODE 2 CREEP MODEL PREDICTION
    Daou, Hikmat
    Raphael, Wassim
    [J]. CIVIL AND ENVIRONMENTAL ENGINEERING, 2022, 18 (01) : 174 - 184
  • [2] An Ensemble of Learning Machine Models for Plant Recognition
    Mokeev, Vladimir
    [J]. ANALYSIS OF IMAGES, SOCIAL NETWORKS AND TEXTS (AIST 2019), 2020, 1086 : 256 - 262
  • [3] Global Optimization with Ensemble Machine Learning Models
    Thebelt, Alexander
    Kronqvist, Jan
    Lee, Robert M.
    Sudermann-Merx, Nathan
    Misener, Ruth
    [J]. 30TH EUROPEAN SYMPOSIUM ON COMPUTER AIDED PROCESS ENGINEERING, PTS A-C, 2020, 48 : 1981 - 1986
  • [4] Decision Tree Ensemble Machine Learning for Rapid QSTS Simulations
    Blakely, Logan
    Reno, Matthew J.
    Broderick, Robert J.
    [J]. 2018 IEEE POWER & ENERGY SOCIETY INNOVATIVE SMART GRID TECHNOLOGIES CONFERENCE (ISGT), 2018,
  • [5] Evaluation of Tree-Based Ensemble Machine Learning Models in Predicting Stock Price Direction of Movement
    Ampomah, Ernest Kwame
    Qin, Zhiguang
    Nyame, Gabriel
    [J]. INFORMATION, 2020, 11 (06)
  • [6] Evaluation of six methods for correcting bias in estimates from ensemble tree machine learning regression models
    Belitz, K.
    Stackelberg, P. E.
    [J]. ENVIRONMENTAL MODELLING & SOFTWARE, 2021, 139
  • [7] Stock Market Decision Support Modeling with Tree-Based Adaboost Ensemble Machine Learning Models
    Ampomah, Ernest Kwame
    Qin, Zhiguang
    Nyame, Gabriel
    Botchey, Francis Effirm
    [J]. INFORMATICA-AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS, 2020, 44 (04): : 477 - 490
  • [9] Application of machine learning ensemble models for rainfall prediction
    Hasan Ahmadi
    Babak Aminnejad
    Hojat Sabatsany
    [J]. Acta Geophysica, 2023, 71 : 1775 - 1786
  • [10] Application of machine learning ensemble models for rainfall prediction
    Ahmadi, Hasan
    Aminnejad, Babak
    Sabatsany, Hojat
    [J]. ACTA GEOPHYSICA, 2023, 71 (04) : 1775 - 1786