Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity

被引:5
|
作者
Torres-Martos, Alvaro [1 ,2 ,3 ]
Bustos-Aibar, Mireia [1 ,2 ,3 ]
Ramirez-Mena, Alberto [4 ]
Camara-Sanchez, Sofia [5 ]
Anguita-Ruiz, Augusto [2 ,3 ,6 ,7 ]
Alcala, Rafael [5 ]
Aguilera, Concepcion M. [1 ,2 ,3 ,7 ]
Alcala-Fdez, Jesus [5 ]
机构
[1] Univ Granada, Dept Biochem & Mol Biol 2, Granada 18071, Spain
[2] Univ Granada, Jose Mataix Verdu Inst Nutr & Food Technol INYTA, Ctr Biomed Res, Granada 18100, Spain
[3] Biosanitary Res Inst Granada IBS GRANADA, Granada 18012, Spain
[4] Ctr Genom & Oncol Res GENYO, Granada 18016, Spain
[5] Univ Granada, Andalusian Res Inst Data Sci & Computat Intelligen, Dept Comp Sci & Artificial Intelligence, Granada 18071, Spain
[6] Barcelona Inst Global Hlth ISGlobal, Barcelona 08003, Spain
[7] Inst Salud Carlos III, CIBER Physiopathol Obes & Nutr CIBEROBN, Madrid 28029, Spain
关键词
machine learning; omics; data pre-processing; IMPUTATION; PREDICTION; MODELS;
D O I
10.3390/genes14020248
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools are subject to the proper application of algorithms as well as the appropriate pre-processing and management of input omics and molecular data. Currently, many of the available approaches that use machine learning on omics data for predictive purposes make mistakes in several of the following key steps: experimental design, feature selection, data pre-processing, and algorithm selection. For this reason, we propose the current work as a guideline on how to confront the main challenges inherent to multi-omics human data. As such, a series of best practices and recommendations are also presented for each of the steps defined. In particular, the main particularities of each omics data layer, the most suitable preprocessing approaches for each source, and a compilation of best practices and tips for the study of disease development prediction using machine learning are described. Using examples of real data, we show how to address the key problems mentioned in multi-omics research (e.g., biological heterogeneity, technical noise, high dimensionality, presence of missing values, and class imbalance). Finally, we define the proposals for model improvement based on the results found, which serve as the bases for future work.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Human Multi-omics Data Pre-processing for Predictive Purposes Using Machine Learning: A Case Study in Childhood Obesity
    Torres-Martos, Alvaro
    Anguita-Ruiz, Augusto
    Bustos-Aibar, Mireia
    Camara-Sanchez, Sofia
    Alcala, Rafael
    Aguilera, Concepcion M.
    Alcala-Fdez, Jesus
    BIOINFORMATICS AND BIOMEDICAL ENGINEERING, PT II, 2022, : 359 - 374
  • [2] Prediction of metabolic risk in childhood obesity using machine learning models with multi-omics data
    Torres-Martos, A.
    Anguita-Ruiz, A.
    Bustos-Aibar, M.
    Alcala, R.
    Alcala-Fdez, J.
    Aguilera, C. M.
    ANNALS OF NUTRITION AND METABOLISM, 2022, 78 (SUPPL 3) : 22 - 22
  • [3] Transparent Data Preprocessing for Machine Learning
    Strasser, Sebastian
    Klettke, Meike
    WORKSHOP ON HUMAN-IN-THE-LOOP DATA ANALYTICS, HILDA 2024, 2024,
  • [4] ASPIS OMICS - Steatosis case study: Machine Learning Group
    Lodhi, S. S. K.
    Bwanya, B.
    Jennen, D.
    Verheijen, M.
    Kok, T. M.
    Caiment, F.
    TOXICOLOGY LETTERS, 2023, 384 : S106 - S106
  • [5] Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
    Vitor Werner de Vargas
    Jorge Arthur Schneider Aranda
    Ricardo dos Santos Costa
    Paulo Ricardo da Silva Pereira
    Jorge Luis Victória Barbosa
    Knowledge and Information Systems, 2023, 65 : 31 - 57
  • [6] Imbalanced data preprocessing techniques for machine learning: a systematic mapping study
    de Vargas, Vitor Werner
    Schneider Aranda, Jorge Arthur
    Costa, Ricardo dos Santos
    da Silva Pereira, Paulo Ricardo
    Victoria Barbosa, Jorge Luis
    KNOWLEDGE AND INFORMATION SYSTEMS, 2023, 65 (01) : 31 - 57
  • [7] Study on Data Preprocessing for Machine Learning Based on Semiconductor Manufacturing Processes
    Park, Ha-Je
    Koo, Yun-Su
    Yang, Hee-Yeong
    Han, Young-Shin
    Nam, Choon-Sung
    SENSORS, 2024, 24 (17)
  • [8] Predicting childhood allergy using machine learning methods on multi-omics data
    van Breugel, Merlijn
    Qi, Cancan
    Jiang, Yale
    Pedersen, Casper-Emil Tingskov
    Pethoukhov, Ilya
    Vonk, Judith
    Gehring, Ulrike
    Berg, Marijn
    Bugel, Marnix
    Capraij, Orestes
    Forno, Erick
    Morin, Andreanne
    Eliasen, Anders Ulrik
    Xu, Zhongli
    Van Den Berge, Maarten
    Nawijn, Martijn
    Li, Yang
    Chen, Wei
    Bont, Louis
    Bonnelykke, Klaus
    Celedon, Juan
    Koppelman, Gerard
    Xu, Cheng-Jian
    EUROPEAN RESPIRATORY JOURNAL, 2021, 58
  • [9] Streamflow prediction in mountainous region using new machine learning and data preprocessing methods: a case study
    Ikram, Rana Muhammad Adnan
    Hazarika, Barenya Bikash
    Gupta, Deepak
    Heddam, Salim
    Kisi, Ozgur
    NEURAL COMPUTING & APPLICATIONS, 2023, 35 (12): : 9053 - 9070
  • [10] Streamflow prediction in mountainous region using new machine learning and data preprocessing methods: a case study
    Rana Muhammad Adnan Ikram
    Barenya Bikash Hazarika
    Deepak Gupta
    Salim Heddam
    Ozgur Kisi
    Neural Computing and Applications, 2023, 35 : 9053 - 9070