Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights

被引:345
|
作者
Pasolli, Edoardo [1 ]
Duy Tin Truong [1 ]
Malik, Faizan [2 ]
Waldron, Levi [2 ]
Segata, Nicola [1 ]
机构
[1] Univ Trento, Ctr Integrat Biol, Trento, Italy
[2] CUNY, Grad Sch Publ Hlth & Hlth Policy, New York, NY 10021 USA
基金
美国国家科学基金会; 美国国家卫生研究院;
关键词
MULTICATEGORY CLASSIFICATION METHODS; HUMAN GUT MICROBIOME; COMPREHENSIVE EVALUATION; FECAL MICROBIOTA; GENE-EXPRESSION; VALIDATION; PREDICTION; REGRESSION; SELECTION;
D O I
10.1371/journal.pcbi.1004977
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Shotgun metagenomic analysis of the human associated microbiome provides a rich set of microbial features for prediction and biomarker discovery in the context of human diseases and health conditions. However, the use of such high-resolution microbial features presents new challenges, and validated computational tools for learning tasks are lacking. Moreover, classification rules have scarcely been validated in independent studies, posing questions about the generality and generalization of disease-predictive models across cohorts. In this paper, we comprehensively assess approaches to metagenomics-based prediction tasks and for quantitative assessment of the strength of potential microbiome-phenotype associations. We develop a computational framework for prediction tasks using quantitative microbiome profiles, including species-level relative abundances and presence of strain-specific markers. A comprehensive meta-analysis, with particular emphasis on generalization across cohorts, was performed in a collection of 2424 publicly available metagenomic samples from eight large-scale studies. Cross-validation revealed good disease-prediction capabilities, which were in general improved by feature selection and use of strain-specific markers instead of species-level taxonomic abundance. In cross-study analysis, models transferred between studies were in some cases less accurate than models tested by within-study cross-validation. Interestingly, the addition of healthy (control) samples from other studies to training sets improved disease prediction capabilities. Some microbial species (most notably Streptococcus anginosus) seem to characterize general dysbiotic states of the microbiome rather than connections with a specific disease. Our results in modelling features of the "healthy" microbiome can be considered a first step toward defining general microbial dysbiosis. The software framework, microbiome profiles, and metadata for thousands of samples are publicly available at http://segatalab.cibio.unitn.it/tools/metaml.
引用
收藏
页数:26
相关论文
共 50 条
  • [21] Data Representativeness in Accessibility Datasets: A Meta-Analysis
    Kamikubo, Rie
    Wang, Lining
    Marte, Crystal
    Mahmood, Amnah
    Kacorri, Hernisa
    PROCEEDINGS OF THE 24TH INTERNATIONAL ACM SIGACCESS CONFERENCE ON COMPUTERS AND ACCESSIBILITY, ASSETS 2022, 2022,
  • [22] A Machine Learning Approach to Reduce Dimensional Space in Large Datasets
    Terol, Rafael Munoz
    Reina, Alejandro Reina
    Ziaei, Saber
    Gil, David
    IEEE ACCESS, 2020, 8 : 148181 - 148192
  • [23] Fast Bayesian Optimization of Machine Learning Hyperparameters on Large Datasets
    Klein, Aaron
    Falkner, Stefan
    Bartels, Simon
    Hennig, Philipp
    Hutter, Frank
    ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 54, 2017, 54 : 528 - 536
  • [24] Quantum machine learning of large datasets using randomized measurements
    Haug, Tobias
    Self, Chris N.
    Kim, M. S.
    MACHINE LEARNING-SCIENCE AND TECHNOLOGY, 2023, 4 (01):
  • [25] Cached sufficient statistics for efficient machine learning with large datasets
    Moore, A
    Lee, MS
    JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 1998, 8 : 67 - 91
  • [26] Gene prediction in metagenomic fragments: A large scale machine learning approach
    Katharina J Hoff
    Maike Tech
    Thomas Lingner
    Rolf Daniel
    Burkhard Morgenstern
    Peter Meinicke
    BMC Bioinformatics, 9
  • [27] Gene prediction in metagenomic fragments: A large scale machine learning approach
    Hoff, Katharina J.
    Tech, Maike
    Lingner, Thomas
    Daniel, Rolf
    Morgenstern, Burkhard
    Meinicke, Peter
    BMC BIOINFORMATICS, 2008, 9 (1)
  • [28] Significant spatiotemporal changes in atmospheric particulate mercury pollution in China: Insights from meta-analysis and machine-learning
    Wang, Haolin
    Li, Tianshuai
    Wang, Guoqiang
    Peng, Yanbo
    Zhang, Qingzhu
    Wang, Xinfeng
    Ren, Yuchao
    Liu, Ruobing
    Yan, Shuwan
    Meng, Qingpeng
    Wang, Yujia
    Wang, Qiao
    Science of the Total Environment, 2024, 955
  • [29] Transcriptional insights into pathogenesis of cutaneous systemic sclerosis using pathway driven meta-analysis assisted by machine learning methods
    Xu, Xiao
    Ramanujam, Meera
    Visvanathan, Sudha
    Assassi, Shervin
    Liu, Zheng
    Li, Li
    PLOS ONE, 2020, 15 (11):
  • [30] Topological data analysis and machine learning for recognizing atmospheric river patterns in large climate datasets
    Muszynski, Grzegorz
    Kashinath, Karthik
    Kurlin, Vitaliy
    Wehner, Michael
    Prabhat
    GEOSCIENTIFIC MODEL DEVELOPMENT, 2019, 12 (02) : 613 - 628