A comparison of machine learning approaches for predicting hepatotoxicity potential using chemical structure and targeted transcriptomic data

被引:2
|
作者
Tate, Tia [1 ]
Patlewicz, Grace [1 ,2 ]
Shah, Imran [1 ]
机构
[1] US EPA, Ctr Computat Toxicol & Exposure CCTE, Durham, NC 27709 USA
[2] US EPA, Ctr Computat Toxicol & Exposure CCTE, 109 TW Alexander Dr, Res Triangle Pk, NC 27711 USA
关键词
Generalised Read-across (GenRA); High throughput transcriptomics (HTTr); Machine Learning (ML); BIOACTIVITY;
D O I
10.1016/j.comtox.2024.100301
中图分类号
R99 [毒物学(毒理学)];
学科分类号
100405 ;
摘要
Animal toxicity testing is time and resource intensive, making it difficult to keep pace with the number of substances requiring assessment. Machine learning (ML) models that use chemical structure information and high-throughput experimental data can be helpful in predicting potential toxicity. However, much of the toxicity data used to train ML models is biased with an unequal balance of positives and negatives primarily since substances selected for in vivo testing are expected to elicit some toxicity effect. To investigate the impact this bias had on predictive performance, various sampling approaches were used to balance in vivo toxicity data as part of a supervised ML workflow to predict hepatotoxicity outcomes from chemical structure and/or targeted transcriptomic data. From the chronic, subchronic, developmental, multigenerational reproductive, and subacute repeat-dose testing toxicity outcomes with a minimum of 50 positive and 50 negative substances, 18 different study-toxicity outcome combinations were evaluated in up to 7 ML models. These included Artificial Neural Networks, Random Forests, Bernouilli Naive Bayes, Gradient Boosting, and Support Vector classification algorithms which were compared with a local approach, Generalised Read-Across (GenRA), a similarity-weighted kNearest Neighbour (k-NN) method. The mean CV F1 performance for unbalanced data across all classifiers and descriptors for chronic liver effects was 0.735 (0.0395 SD). Mean CV F1 performance dropped to 0.639 (0.073 SD) with over-sampling approaches though the poorer performance of KNN approaches in some cases contributed to the observed decrease (mean CV F1 performance excluding KNN was 0.697 (0.072 SD)). With undersampling approaches, the mean CV F1 was 0.523 (0.083 SD). For developmental liver effects, the mean CV F1 performance was much lower with 0.089 (0.111 SD) for unbalanced approaches and 0.149 (0.084 SD) for undersampling. Over-sampling approaches led to an increase in mean CV F1 performance (0.234, (0.107 SD)) for developmental liver toxicity. Model performance was found to be dependent on dataset, model type, balancing approach and feature selection. Accordingly tailoring ML workflows for predicting toxicity should consider class imbalance and rely on simple classifiers first.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Predicting liver cancer on epigenomics data using machine learning
    Vekariya, Vishalkumar
    Passi, Kalpdrum
    Jain, Chakresh Kumar
    FRONTIERS IN BIOINFORMATICS, 2022, 2
  • [42] Predicting Student Performance Using Machine Learning in fNIRS Data
    Oku, Amanda Yumi Ambriola
    Sato, Joao Ricardo
    FRONTIERS IN HUMAN NEUROSCIENCE, 2021, 15
  • [43] Comparison of machine learning approaches for structure-function modeling in glaucoma
    Wong, Damon
    Chua, Jacqueline
    Bujor, Inna
    Chong, Rachel S.
    Nongpiur, Monisha E.
    Vithana, Eranga N.
    Husain, Rahat
    Aung, Tin
    Popa-Cherecheanu, Alina
    Schmetterer, Leopold
    ANNALS OF THE NEW YORK ACADEMY OF SCIENCES, 2022, 1515 (01) : 237 - 248
  • [44] Computational Models Using Multiple Machine Learning Algorithms for Predicting Drug Hepatotoxicity with the DILIrank Dataset
    Ancuceanu, Robert
    Hovanet, Marilena Viorica
    Anghel, Adriana Iuliana
    Furtunescu, Florentina
    Neagu, Monica
    Constantin, Carolina
    Dinu, Mihaela
    INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2020, 21 (06)
  • [45] Predicting operon and regulon structure in Archaeoglobus fulgidus using transcriptomic data.
    Rohlin, L
    Sabatti, C
    Liao, JC
    ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2003, 225 : U204 - U204
  • [46] Predicting the secondary structure of proteins using Machine Learning algorithms
    Camacho, Rui
    Ferreira, Rita
    Rosa, Natacha
    Guimaraes, Vania
    Fonseca, Nuno A.
    Costa, Vitor Santos
    de Sousa, Miguel
    Magalhaes, Alexandre
    INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2012, 6 (06) : 571 - 584
  • [47] Challenges in Electroencephalography Data Processing Using Machine Learning Approaches
    Alvi, Ashik Mostafa
    Siuly, Siuly
    Wang, Hua
    DATABASES THEORY AND APPLICATIONS (ADC 2022), 2022, 13459 : 177 - 184
  • [48] Analysis of Data Using Machine Learning Approaches in Social Networks
    Ertam, Fatih
    2017 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2017, : 812 - 815
  • [49] Predicting rock mass strength from drilling data using synergistic unsupervised and supervised machine learning approaches
    Komadja, Gbetoglo Charles
    Westman, Erik
    Rana, Aditya
    Vitalis, Anye
    EARTH SCIENCE INFORMATICS, 2025, 18 (03)
  • [50] Potential of machine learning approaches for predicting mechanical properties of spruce wood in the transverse direction
    Shuoye Chen
    Rei Shiina
    Kazushi Nakai
    Tatsuya Awano
    Arata Yoshinaga
    Junji Sugiyama
    Journal of Wood Science, 69