Microbiome Preprocessing Machine Learning Pipeline

被引:11
|
作者
Jasner, Yoel Y. [1 ]
Belogolovski, Anna [1 ]
Ben-Itzhak, Meirav [1 ]
Koren, Omry [2 ]
Louzoun, Yoram [1 ]
机构
[1] Bar Ilan Univ, Dept Math, Ramat Gan, Israel
[2] Bar Ilan Univ, Azrieli Fac Med, Ramat Gan, Israel
来源
FRONTIERS IN IMMUNOLOGY | 2021年 / 12卷
关键词
pipeline; machine learning; 16S; OTU; ASV; feature selection;
D O I
10.3389/fimmu.2021.677870
中图分类号
R392 [医学免疫学]; Q939.91 [免疫学];
学科分类号
100102 ;
摘要
Background 16S sequencing results are often used for Machine Learning (ML) tasks. 16S gene sequences are represented as feature counts, which are associated with taxonomic representation. Raw feature counts may not be the optimal representation for ML. Methods We checked multiple preprocessing steps and tested the optimal combination for 16S sequencing-based classification tasks. We computed the contribution of each step to the accuracy as measured by the Area Under Curve (AUC) of the classification. Results We show that the log of the feature counts is much more informative than the relative counts. We further show that merging features associated with the same taxonomy at a given level, through a dimension reduction step for each group of bacteria improves the AUC. Finally, we show that z-scoring has a very limited effect on the results. Conclusions The prepossessing of microbiome 16S data is crucial for optimal microbiome based Machine Learning. These preprocessing steps are integrated into the MIPMLP - Microbiome Preprocessing Machine Learning Pipeline, which is available as a stand-alone version at: https://github.com/louzounlab/microbiome/tree/master/Preprocess or as a service at http://mip-mlp.math.biu.ac.il/Home Both contain the code, and standard test sets.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Overview of data preprocessing for machine learning applications in human microbiome research
    Ibrahimi, Eliana
    Lopes, Marta B.
    Dhamo, Xhilda
    Simeon, Andrea
    Shigdel, Rajesh
    Hron, Karel
    Stres, Blaz
    D'Elia, Domenica
    Berland, Magali
    Marcos-Zambrano, Laura Judith
    FRONTIERS IN MICROBIOLOGY, 2023, 14
  • [2] A Data-Driven Methodology for Guiding the Selection of Preprocessing Techniques in a Machine Learning Pipeline
    Garcia-Carraseo, Jorge
    Mate, Alejandro
    Trujillo, Juan
    INTELLIGENT INFORMATION SYSTEMS, CAISE FORUM 2023, 2023, 477 : 34 - 42
  • [3] Fair Preprocessing: Towards Understanding Compositional Fairness of Data Transformers in Machine Learning Pipeline
    Biswas, Sumon
    Rajan, Hridesh
    PROCEEDINGS OF THE 29TH ACM JOINT MEETING ON EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING (ESEC/FSE '21), 2021, : 981 - 993
  • [4] A comprehensive pipeline to integrate preprocessing and machine learning techniques for accurate classification in Raman spectroscopy
    Innocente, Simone
    Maryam, Siddra
    Andersson-Engels, Stefan
    Komolibus, Katarzyna
    Gautam, Rekha
    Visentin, Andrea
    DATA SCIENCE FOR PHOTONICS AND BIOPHOTONICS, 2024, 13011
  • [5] mAML: an automated machine learning pipeline with a microbiome repository for human disease classification
    Yang, Fenglong
    Zou, Quan
    DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2020,
  • [6] Transparent Data Preprocessing for Machine Learning
    Strasser, Sebastian
    Klettke, Meike
    WORKSHOP ON HUMAN-IN-THE-LOOP DATA ANALYTICS, HILDA 2024, 2024,
  • [7] Preprocessing Pipeline Optimization for Scientific Deep Learning Workloads
    Ibrahim, Khaled Z.
    Oliker, Leonid
    2022 IEEE 36TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM (IPDPS 2022), 2022, : 1118 - 1128
  • [8] Editorial: Microbiome and Machine Learning
    Moreno-Indias, Isabel
    Zomer, Aldert L.
    Gomez-Cabrero, David
    Claesson, Marcus J.
    FRONTIERS IN MICROBIOLOGY, 2022, 13
  • [9] Machine Learning Preprocessing Method for Suicide Prediction
    Iliou, Theodoros
    Konstantopoulou, Georgia
    Ntekouli, Mandani
    Lymberopoulos, Dimitrios
    Assimakopoulos, Konstantinos
    Galiatsatos, Dimitrios
    Anastassopoulos, George
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2016, 2016, 475 : 53 - 60
  • [10] The "Idealized Machine Learning Pipeline" for Advancing Reproducibility in Machine Learning
    Zheng, Yantong
    Stodden, Victoria
    PROCEEDINGS OF THE 2ND ACM CONFERENCE ON REPRODUCIBILITY AND REPLICABILITY, ACM REP 2024, 2024, : 110 - 120