Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity

被引：5

作者：

Torres-Martos, Alvaro ^{[1
,2
,3
]}

Bustos-Aibar, Mireia ^{[1
,2
,3
]}

Ramirez-Mena, Alberto ^{[4
]}

Camara-Sanchez, Sofia ^{[5
]}

Anguita-Ruiz, Augusto ^{[2
,3
,6
,7
]}

Alcala, Rafael ^{[5
]}

Aguilera, Concepcion M. ^{[1
,2
,3
,7
]}

Alcala-Fdez, Jesus ^{[5
]}

机构：

[1] Univ Granada, Dept Biochem & Mol Biol 2, Granada 18071, Spain

[2] Univ Granada, Jose Mataix Verdu Inst Nutr & Food Technol INYTA, Ctr Biomed Res, Granada 18100, Spain

[3] Biosanitary Res Inst Granada IBS GRANADA, Granada 18012, Spain

[4] Ctr Genom & Oncol Res GENYO, Granada 18016, Spain

[5] Univ Granada, Andalusian Res Inst Data Sci & Computat Intelligen, Dept Comp Sci & Artificial Intelligence, Granada 18071, Spain

[6] Barcelona Inst Global Hlth ISGlobal, Barcelona 08003, Spain

[7] Inst Salud Carlos III, CIBER Physiopathol Obes & Nutr CIBEROBN, Madrid 28029, Spain

来源：

GENES | 2023年 / 14卷 / 02期

关键词：

machine learning; omics; data pre-processing; IMPUTATION; PREDICTION; MODELS;

D O I：

10.3390/genes14020248

中图分类号：

Q3 [遗传学];

学科分类号：

071007 ; 090102 ;

摘要：

The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools are subject to the proper application of algorithms as well as the appropriate pre-processing and management of input omics and molecular data. Currently, many of the available approaches that use machine learning on omics data for predictive purposes make mistakes in several of the following key steps: experimental design, feature selection, data pre-processing, and algorithm selection. For this reason, we propose the current work as a guideline on how to confront the main challenges inherent to multi-omics human data. As such, a series of best practices and recommendations are also presented for each of the steps defined. In particular, the main particularities of each omics data layer, the most suitable preprocessing approaches for each source, and a compilation of best practices and tips for the study of disease development prediction using machine learning are described. Using examples of real data, we show how to address the key problems mentioned in multi-omics research (e.g., biological heterogeneity, technical noise, high dimensionality, presence of missing values, and class imbalance). Finally, we define the proposals for model improvement based on the results found, which serve as the bases for future work.

引用

页数：16

共 50 条

[31] On Evaluating Data Preprocessing Methods for Machine Learning Models for Flight Delays
Moreira, Leonardo
Dantas, Christofer
Oliveira, Leonardo
Soares, Jorge
Ogasawara, Eduardo
2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018, : 779 - 786
[32] Exploring House Price Forecasting through Machine Learning and Data Preprocessing
Vaishnavi, A. V. S. S. P. L.
Raghavendra, G. Gopi Krishna
Jilan, Mohammed
Chowdary, A. Pranya
Singh, Rosen
Karthikeyan, C.
2024 4TH INTERNATIONAL CONFERENCE ON PERVASIVE COMPUTING AND SOCIAL NETWORKING, ICPCSN 2024, 2024, : 304 - 310
[33] Atlantic-Automated data preprocessing framework for supervised machine learning
Santos, Luis
Ferreira, Luis
SOFTWARE IMPACTS, 2023, 17
[34] Overview of data preprocessing for machine learning applications in human microbiome research
Ibrahimi, Eliana
Lopes, Marta B.
Dhamo, Xhilda
Simeon, Andrea
Shigdel, Rajesh
Hron, Karel
Stres, Blaz
D'Elia, Domenica
Berland, Magali
Marcos-Zambrano, Laura Judith
FRONTIERS IN MICROBIOLOGY, 2023, 14
[35] Iliou Machine Learning Data Preprocessing Method for Stress Level Prediction
Iliou, Theodoros
Konstantopoulou, Georgia
Stephanakis, Ioannis
Anastasopoulos, Konstantinos
Lymberopoulos, Dimitrios
Anastassopoulos, George
ARTIFICIAL INTELLIGENCE APPLICATIONS AND INNOVATIONS, AIAI 2018, 2018, 519 : 351 - 361
[36] Diabetes Type 2: Poincare Data Preprocessing for Quantum Machine Learning
Sierra-Sosa, Daniel
Arcila-Moreno, Juan D.
Garcia-Zapirain, Begonya
Elmaghraby, Adel
CMC-COMPUTERS MATERIALS & CONTINUA, 2021, 67 (02): : 1849 - 1861
[37] Prediction of Distillation Column Temperature Using Machine Learning and Data Preprocessing
Lee, Yechan
Choi, Yeongryeol
Cho, Hyungtae
Kim, Junghwan
KOREAN CHEMICAL ENGINEERING RESEARCH, 2021, 59 (02): : 191 - 199
[38] Machine Learning Methods Based Preprocessing to Improve Categorical Data Classification
Ruiz-Chavez, Zoila
Salvador-Meneses, Jaime
Garcia-Rodriguez, Jose
INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2018, PT I, 2018, 11314 : 297 - 304
[39] Machine learning and deep learning methods that use omics data for metastasis prediction
Albaradei, Somayah
Thafar, Maha
Alsaedi, Asim
Van Neste, Christophe
Gojobori, Takashi
Essack, Magbubah
Gao, Xin
COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2021, 19 : 5008 - 5018
[40] Effect of Data Preprocessing in the Detection of Epilepsy using Machine Learning Techniques
Sabarivani, A.
Ramadevi, R.
Pandian, R.
Krishnamoorthy, N. R.
JOURNAL OF SCIENTIFIC & INDUSTRIAL RESEARCH, 2021, 80 (12): : 1066 - 1077

← 1 2 3 4 5 →