Benchmark Dataset for Training Machine Learning Models to Predict the Pathway Involvement of Metabolites

被引:1
|
作者
Huckvale, Erik D. [1 ,2 ]
Powell, Christian D. [1 ,2 ,3 ]
Jin, Huan [4 ]
Moseley, Hunter N. B. [1 ,2 ,4 ,5 ,6 ]
机构
[1] Univ Kentucky, Markey Canc Ctr, Lexington, KY 40506 USA
[2] Univ Kentucky, Superfund Res Ctr, Lexington, KY 40506 USA
[3] Univ Kentucky, Dept Comp Sci, Data Sci Program, Lexington, KY 40506 USA
[4] Univ Kentucky, Dept Toxicol & Canc Biol, Lexington, KY 40536 USA
[5] Univ Kentucky, Dept Mol & Cellular Biochem, Lexington, KY 40506 USA
[6] Univ Kentucky, Inst Biomed Informat, Lexington, KY 40506 USA
基金
美国国家科学基金会;
关键词
metabolite; pathway; machine learning; KEGG; kegg_pull; md_harmonize; atom color; KNOWLEDGEBASE;
D O I
10.3390/metabo13111120
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Metabolic pathways are a human-defined grouping of life sustaining biochemical reactions, metabolites being both the reactants and products of these reactions. But many public datasets include identified metabolites whose pathway involvement is unknown, hindering metabolic interpretation. To address these shortcomings, various machine learning models, including those trained on data from the Kyoto Encyclopedia of Genes and Genomes (KEGG), have been developed to predict the pathway involvement of metabolites based on their chemical descriptions; however, these prior models are based on old metabolite KEGG-based datasets, including one benchmark dataset that is invalid due to the presence of over 1500 duplicate entries. Therefore, we have developed a new benchmark dataset derived from the KEGG following optimal standards of scientific computational reproducibility and including all source code needed to update the benchmark dataset as KEGG changes. We have used this new benchmark dataset with our atom coloring methodology to develop and compare the performance of Random Forest, XGBoost, and multilayer perceptron with autoencoder models generated from our new benchmark dataset. Best overall weighted average performance across 1000 unique folds was an F1 score of 0.8180 and a Matthews correlation coefficient of 0.7933, which was provided by XGBoost binary classification models for 11 KEGG-defined pathway categories.
引用
收藏
页数:24
相关论文
共 50 条
  • [1] A benchmark dataset for machine learning in ecotoxicology
    Schuer, Christoph
    Gasser, Lilian
    Perez-Cruz, Fernando
    Schirmer, Kristin
    Baity-Jesi, Marco
    [J]. SCIENTIFIC DATA, 2023, 10 (01)
  • [2] A benchmark dataset for machine learning in ecotoxicology
    Christoph Schür
    Lilian Gasser
    Fernando Perez-Cruz
    Kristin Schirmer
    Marco Baity-Jesi
    [J]. Scientific Data, 10
  • [3] Forged handwriting verification: a public domain dataset for training machine learning models
    Monaro, Merylin
    Fietta, Valentina
    Curro, Valentina
    Lusetti, Giulia
    Sartori, Giuseppe
    Navarin, Nicolo
    [J]. 2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [4] Whales from space dataset, an annotated satellite image dataset of whales for training machine learning models
    Hannah C. Cubaynes
    Peter T. Fretwell
    [J]. Scientific Data, 9
  • [5] The cancer omics and drug experimental response dataset (CODERData): A harmonized benchmark dataset for machine learning models of drug response prediction
    Jacobson, Jeremy
    Schwartz, Sydney
    Weil, M. Ryan
    Kumar, Neeraj
    Gosline, Sara
    [J]. CANCER RESEARCH, 2024, 84 (06)
  • [6] Machine learning models to predict and benchmark PICU length of stay with application to children with critical bronchiolitis
    Rogerson, Colin M. M.
    Heneghan, Julia A. A.
    Kohne, Joseph G. G.
    Goodman, Denise M. M.
    Slain, Katherine N. N.
    Cecil, Cara A. A.
    Kane, Jason M. M.
    Hall, Matt
    [J]. PEDIATRIC PULMONOLOGY, 2023, 58 (06) : 1777 - 1783
  • [7] RanSAP: An open dataset of ransomware storage access patterns for training machine learning models
    Hirano, Manabu
    Hodota, Ryo
    Kobayashi, Ryotaro
    [J]. FORENSIC SCIENCE INTERNATIONAL-DIGITAL INVESTIGATION, 2022, 40
  • [8] Agreements 'in the wild': Standards and alignment in machine learning benchmark dataset construction
    Engdahl, Isak
    [J]. BIG DATA & SOCIETY, 2024, 11 (02)
  • [9] Benchmark Tests of Atom Segmentation Deep Learning Models with a Consistent Dataset
    Wei, Jingrui
    Blaiszik, Ben
    Scourtas, Aristana
    Morgan, Dane
    Voyles, Paul M.
    [J]. MICROSCOPY AND MICROANALYSIS, 2023, 29 (02) : 552 - 562
  • [10] Learning real-world heterogeneous noise models with a benchmark dataset
    Sun, Lu
    Lin, Jie
    Dong, Weisheng
    Li, Xin
    Wu, Jinjian
    Shi, Guangming
    [J]. PATTERN RECOGNITION, 2024, 156