Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity

被引:5
|
作者
Torres-Martos, Alvaro [1 ,2 ,3 ]
Bustos-Aibar, Mireia [1 ,2 ,3 ]
Ramirez-Mena, Alberto [4 ]
Camara-Sanchez, Sofia [5 ]
Anguita-Ruiz, Augusto [2 ,3 ,6 ,7 ]
Alcala, Rafael [5 ]
Aguilera, Concepcion M. [1 ,2 ,3 ,7 ]
Alcala-Fdez, Jesus [5 ]
机构
[1] Univ Granada, Dept Biochem & Mol Biol 2, Granada 18071, Spain
[2] Univ Granada, Jose Mataix Verdu Inst Nutr & Food Technol INYTA, Ctr Biomed Res, Granada 18100, Spain
[3] Biosanitary Res Inst Granada IBS GRANADA, Granada 18012, Spain
[4] Ctr Genom & Oncol Res GENYO, Granada 18016, Spain
[5] Univ Granada, Andalusian Res Inst Data Sci & Computat Intelligen, Dept Comp Sci & Artificial Intelligence, Granada 18071, Spain
[6] Barcelona Inst Global Hlth ISGlobal, Barcelona 08003, Spain
[7] Inst Salud Carlos III, CIBER Physiopathol Obes & Nutr CIBEROBN, Madrid 28029, Spain
关键词
machine learning; omics; data pre-processing; IMPUTATION; PREDICTION; MODELS;
D O I
10.3390/genes14020248
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools are subject to the proper application of algorithms as well as the appropriate pre-processing and management of input omics and molecular data. Currently, many of the available approaches that use machine learning on omics data for predictive purposes make mistakes in several of the following key steps: experimental design, feature selection, data pre-processing, and algorithm selection. For this reason, we propose the current work as a guideline on how to confront the main challenges inherent to multi-omics human data. As such, a series of best practices and recommendations are also presented for each of the steps defined. In particular, the main particularities of each omics data layer, the most suitable preprocessing approaches for each source, and a compilation of best practices and tips for the study of disease development prediction using machine learning are described. Using examples of real data, we show how to address the key problems mentioned in multi-omics research (e.g., biological heterogeneity, technical noise, high dimensionality, presence of missing values, and class imbalance). Finally, we define the proposals for model improvement based on the results found, which serve as the bases for future work.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] On Evaluating Data Preprocessing Methods for Machine Learning Models for Flight Delays
    Moreira, Leonardo
    Dantas, Christofer
    Oliveira, Leonardo
    Soares, Jorge
    Ogasawara, Eduardo
    2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018, : 779 - 786
  • [32] Exploring House Price Forecasting through Machine Learning and Data Preprocessing
    Vaishnavi, A. V. S. S. P. L.
    Raghavendra, G. Gopi Krishna
    Jilan, Mohammed
    Chowdary, A. Pranya
    Singh, Rosen
    Karthikeyan, C.
    2024 4TH INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING AND SOCIAL NETWORKING, ICPCSN 2024, 2024, : 304 - 310
  • [33] Atlantic-Automated data preprocessing framework for supervised machine learning
    Santos, Luis
    Ferreira, Luis
    SOFTWARE IMPACTS, 2023, 17
  • [34] Overview of data preprocessing for machine learning applications in human microbiome research
    Ibrahimi, Eliana
    Lopes, Marta B.
    Dhamo, Xhilda
    Simeon, Andrea
    Shigdel, Rajesh
    Hron, Karel
    Stres, Blaz
    D'Elia, Domenica
    Berland, Magali
    Marcos-Zambrano, Laura Judith
    FRONTIERS IN MICROBIOLOGY, 2023, 14
  • [35] Iliou Machine Learning Data Preprocessing Method for Stress Level Prediction
    Iliou, Theodoros
    Konstantopoulou, Georgia
    Stephanakis, Ioannis
    Anastasopoulos, Konstantinos
    Lymberopoulos, Dimitrios
    Anastassopoulos, George
    ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2018, 2018, 519 : 351 - 361
  • [36] Diabetes Type 2: Poincare Data Preprocessing for Quantum Machine Learning
    Sierra-Sosa, Daniel
    Arcila-Moreno, Juan D.
    Garcia-Zapirain, Begonya
    Elmaghraby, Adel
    CMC-COMPUTERS MATERIALS & CONTINUA, 2021, 67 (02): : 1849 - 1861
  • [37] Prediction of Distillation Column Temperature Using Machine Learning and Data Preprocessing
    Lee, Yechan
    Choi, Yeongryeol
    Cho, Hyungtae
    Kim, Junghwan
    KOREAN CHEMICAL ENGINEERING RESEARCH, 2021, 59 (02): : 191 - 199
  • [38] Machine Learning Methods Based Preprocessing to Improve Categorical Data Classification
    Ruiz-Chavez, Zoila
    Salvador-Meneses, Jaime
    Garcia-Rodriguez, Jose
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2018, PT I, 2018, 11314 : 297 - 304
  • [39] Machine learning and deep learning methods that use omics data for metastasis prediction
    Albaradei, Somayah
    Thafar, Maha
    Alsaedi, Asim
    Van Neste, Christophe
    Gojobori, Takashi
    Essack, Magbubah
    Gao, Xin
    COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2021, 19 : 5008 - 5018
  • [40] Effect of Data Preprocessing in the Detection of Epilepsy using Machine Learning Techniques
    Sabarivani, A.
    Ramadevi, R.
    Pandian, R.
    Krishnamoorthy, N. R.
    JOURNAL OF SCIENTIFIC & INDUSTRIAL RESEARCH, 2021, 80 (12): : 1066 - 1077