Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity

被引:5
|
作者
Torres-Martos, Alvaro [1 ,2 ,3 ]
Bustos-Aibar, Mireia [1 ,2 ,3 ]
Ramirez-Mena, Alberto [4 ]
Camara-Sanchez, Sofia [5 ]
Anguita-Ruiz, Augusto [2 ,3 ,6 ,7 ]
Alcala, Rafael [5 ]
Aguilera, Concepcion M. [1 ,2 ,3 ,7 ]
Alcala-Fdez, Jesus [5 ]
机构
[1] Univ Granada, Dept Biochem & Mol Biol 2, Granada 18071, Spain
[2] Univ Granada, Jose Mataix Verdu Inst Nutr & Food Technol INYTA, Ctr Biomed Res, Granada 18100, Spain
[3] Biosanitary Res Inst Granada IBS GRANADA, Granada 18012, Spain
[4] Ctr Genom & Oncol Res GENYO, Granada 18016, Spain
[5] Univ Granada, Andalusian Res Inst Data Sci & Computat Intelligen, Dept Comp Sci & Artificial Intelligence, Granada 18071, Spain
[6] Barcelona Inst Global Hlth ISGlobal, Barcelona 08003, Spain
[7] Inst Salud Carlos III, CIBER Physiopathol Obes & Nutr CIBEROBN, Madrid 28029, Spain
关键词
machine learning; omics; data pre-processing; IMPUTATION; PREDICTION; MODELS;
D O I
10.3390/genes14020248
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools are subject to the proper application of algorithms as well as the appropriate pre-processing and management of input omics and molecular data. Currently, many of the available approaches that use machine learning on omics data for predictive purposes make mistakes in several of the following key steps: experimental design, feature selection, data pre-processing, and algorithm selection. For this reason, we propose the current work as a guideline on how to confront the main challenges inherent to multi-omics human data. As such, a series of best practices and recommendations are also presented for each of the steps defined. In particular, the main particularities of each omics data layer, the most suitable preprocessing approaches for each source, and a compilation of best practices and tips for the study of disease development prediction using machine learning are described. Using examples of real data, we show how to address the key problems mentioned in multi-omics research (e.g., biological heterogeneity, technical noise, high dimensionality, presence of missing values, and class imbalance). Finally, we define the proposals for model improvement based on the results found, which serve as the bases for future work.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] eUTOPIA: solUTion for Omics data PreprocessIng and Analysis
    Marwah, Veer Singh
    Scala, Giovanni
    Kinaret, Pia Anneli Sofia
    Serra, Angela
    Alenius, Harri
    Fortino, Vittorio
    Greco, Dario
    SOURCE CODE FOR BIOLOGY AND MEDICINE, 2019, 14
  • [22] SpeedyLoader: Efficient Pipelining of Data Preprocessing and Machine Learning Training
    Nouaji, Rahma
    Bitchebe, Stella
    Balmau, Oana
    PROCEEDINGS OF THE 2024 4TH WORKSHOP ON MACHINE LEARNING AND SYSTEMS, EUROMLSYS 2024, 2024, : 65 - 72
  • [23] Machine Learning Techniques for Prediction of Early Childhood Obesity
    Dugan, T. M.
    Mukhopadhyay, S.
    Carroll, A.
    Downs, S.
    APPLIED CLINICAL INFORMATICS, 2015, 6 (03): : 506 - 520
  • [24] Data preprocessing for machine-learning-based adaptive data center transmission
    Keykhosravi, Kamran
    Hamednia, Ahad
    Rastegarfar, Houman
    Agrell, Erik
    ICT EXPRESS, 2022, 8 (01): : 37 - 43
  • [25] Data preprocessing and feature selection techniques in gait recognition: A comparative study of machine learning and deep learning approaches
    Parashar, Anubha
    Parashar, Apoorva
    Ding, Weiping
    Shabaz, Mohammad
    Rida, Imad
    PATTERN RECOGNITION LETTERS, 2023, 172 : 65 - 73
  • [26] Robust double machine learning model with application to omics data
    Wang, Xuqing
    Liu, Yahang
    Qin, Guoyou
    Yu, Yongfu
    BMC Bioinformatics, 2024, 25 (01)
  • [27] Methodology for Good Machine Learning with Multi-Omics Data
    Coroller, Thibaud
    Sahiner, Berkman
    Amatya, Anup
    Gossmann, Alexej
    Karagiannis, Konstantinos
    Moloney, Conor
    Samala, Ravi K.
    Santana-Quintero, Luis
    Solovieff, Nadia
    Wang, Craig
    Amiri-Kordestani, Laleh
    Cao, Qian
    Cha, Kenny H.
    Charlab, Rosane
    Cross, Frank H.
    Hu, Tingting
    Huang, Ruihao
    Kraft, Jeffrey
    Krusche, Peter
    Li, Yutong
    Li, Zheng
    Mazo, Ilya
    Paul, Rahul
    Schnakenberg, Susan
    Serra, Paolo
    Smith, Sean
    Song, Chi
    Su, Fei
    Tiwari, Mohit
    Vechery, Colin
    Xiong, Xin
    Zarate, Juan Pablo
    Zhu, Hao
    Chakravartty, Arunava
    Liu, Qi
    Ohlssen, David
    Petrick, Nicholas
    Schneider, Julie A.
    Walderhaug, Mark
    Zuber, Emmanuel
    CLINICAL PHARMACOLOGY & THERAPEUTICS, 2024, 115 (04) : 745 - 757
  • [28] Machine learning for multi-omics data integration in cancer
    Cai, Zhaoxiang
    Poulos, Rebecca C.
    Liu, Jia
    Zhong, Qing
    ISCIENCE, 2022, 25 (02)
  • [29] Riemannian data preprocessing in machine learning to focus on QCD color structure
    Hammad, Ahmed
    Park, Myeonghun
    JOURNAL OF THE KOREAN PHYSICAL SOCIETY, 2023, 83 (04) : 235 - 242
  • [30] Riemannian data preprocessing in machine learning to focus on QCD color structure
    Ahmed Hammad
    Myeonghun Park
    Journal of the Korean Physical Society, 2023, 83 : 235 - 242