Omics Data Preprocessing for Machine Learning: A Case Study in Childhood Obesity

被引:5
|
作者
Torres-Martos, Alvaro [1 ,2 ,3 ]
Bustos-Aibar, Mireia [1 ,2 ,3 ]
Ramirez-Mena, Alberto [4 ]
Camara-Sanchez, Sofia [5 ]
Anguita-Ruiz, Augusto [2 ,3 ,6 ,7 ]
Alcala, Rafael [5 ]
Aguilera, Concepcion M. [1 ,2 ,3 ,7 ]
Alcala-Fdez, Jesus [5 ]
机构
[1] Univ Granada, Dept Biochem & Mol Biol 2, Granada 18071, Spain
[2] Univ Granada, Jose Mataix Verdu Inst Nutr & Food Technol INYTA, Ctr Biomed Res, Granada 18100, Spain
[3] Biosanitary Res Inst Granada IBS GRANADA, Granada 18012, Spain
[4] Ctr Genom & Oncol Res GENYO, Granada 18016, Spain
[5] Univ Granada, Andalusian Res Inst Data Sci & Computat Intelligen, Dept Comp Sci & Artificial Intelligence, Granada 18071, Spain
[6] Barcelona Inst Global Hlth ISGlobal, Barcelona 08003, Spain
[7] Inst Salud Carlos III, CIBER Physiopathol Obes & Nutr CIBEROBN, Madrid 28029, Spain
关键词
machine learning; omics; data pre-processing; IMPUTATION; PREDICTION; MODELS;
D O I
10.3390/genes14020248
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
The use of machine learning techniques for the construction of predictive models of disease outcomes (based on omics and other types of molecular data) has gained enormous relevance in the last few years in the biomedical field. Nonetheless, the virtuosity of omics studies and machine learning tools are subject to the proper application of algorithms as well as the appropriate pre-processing and management of input omics and molecular data. Currently, many of the available approaches that use machine learning on omics data for predictive purposes make mistakes in several of the following key steps: experimental design, feature selection, data pre-processing, and algorithm selection. For this reason, we propose the current work as a guideline on how to confront the main challenges inherent to multi-omics human data. As such, a series of best practices and recommendations are also presented for each of the steps defined. In particular, the main particularities of each omics data layer, the most suitable preprocessing approaches for each source, and a compilation of best practices and tips for the study of disease development prediction using machine learning are described. Using examples of real data, we show how to address the key problems mentioned in multi-omics research (e.g., biological heterogeneity, technical noise, high dimensionality, presence of missing values, and class imbalance). Finally, we define the proposals for model improvement based on the results found, which serve as the bases for future work.
引用
收藏
页数:16
相关论文
共 50 条
  • [41] Significance and methodology: Preprocessing the big data for machine learning on TBM performance
    Xiao, Hao-Han
    Yang, Wen-Kun
    Hu, Jing
    Zhang, Yun-Pei
    Jing, Liu-Jie
    Chen, Zu-Yu
    UNDERGROUND SPACE, 2022, 7 (04) : 680 - 701
  • [42] Machine learning for precision medicine forecasts and challenges when incorporating non omics and omics data
    Susymary, J.
    Deepalakshmi, P.
    INTELLIGENT DECISION TECHNOLOGIES-NETHERLANDS, 2021, 15 (01): : 69 - 85
  • [43] Machine Learning Models to Predict Childhood and Adolescent Obesity: A Review
    Colmenarejo, Gonzalo
    NUTRIENTS, 2020, 12 (08) : 1 - 31
  • [44] A Survey on Machine and Deep Learning Models for Childhood and Adolescent Obesity
    Siddiqui, Hera
    Rattani, Ajita
    Woods, Nikki K.
    Cure, Laila
    Lewis, Rhonda K.
    Twomey, Janet
    Smith-Campbell, Betty
    Hill, Twyla J.
    IEEE ACCESS, 2021, 9 : 157337 - 157360
  • [45] Machine learning and systems genomics approaches for multi-omics data
    Lin, Eugene
    Lane, Hsien-Yuan
    BIOMARKER RESEARCH, 2017, 5
  • [46] Dealing with dimensionality: the application of machine learning to multi-omics data
    Feldner-Busztin, Dylan
    Nisantzis, Panos Firbas
    Edmunds, Shelley Jane
    Boza, Gergely
    Racimo, Fernando
    Gopalakrishnan, Shyam
    Limborg, Morten Tonsberg
    Lahti, Leo
    de Polavieja, Gonzalo G.
    BIOINFORMATICS, 2023, 39 (02)
  • [47] Microbiome Preprocessing Machine Learning Pipeline
    Jasner, Yoel Y.
    Belogolovski, Anna
    Ben-Itzhak, Meirav
    Koren, Omry
    Louzoun, Yoram
    FRONTIERS IN IMMUNOLOGY, 2021, 12
  • [48] ClinicalomicsDB - Bridging the gap between clinical omics data and machine learning
    Moon, Chang In
    Jia, Byron
    Zhang, Bing
    CANCER RESEARCH, 2023, 83 (05)
  • [49] Integration strategies of multi-omics data for machine learning analysis
    Picard M.
    Scott-Boyer M.-P.
    Bodein A.
    Périn O.
    Droit A.
    Computational and Structural Biotechnology Journal, 2021, 19 : 3735 - 3746
  • [50] Integration strategies of multi-omics data for machine learning analysis
    Picard, Milan
    Scott-Boyer, Marie -Pier
    Bodein, Antoine
    Perin, Olivier
    Droit, Arnaud
    COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2021, 19 : 3735 - 3746