A comparison of machine learning approaches for predicting hepatotoxicity potential using chemical structure and targeted transcriptomic data

被引：2

作者：

Tate, Tia ^{[1
]}

Patlewicz, Grace ^{[1
,2
]}

Shah, Imran ^{[1
]}

机构：

[1] US EPA, Ctr Computat Toxicol & Exposure CCTE, Durham, NC 27709 USA

[2] US EPA, Ctr Computat Toxicol & Exposure CCTE, 109 TW Alexander Dr, Res Triangle Pk, NC 27711 USA

来源：

COMPUTATIONAL TOXICOLOGY | 2024年 / 29卷

关键词：

Generalised Read-across (GenRA); High throughput transcriptomics (HTTr); Machine Learning (ML); BIOACTIVITY;

D O I：

10.1016/j.comtox.2024.100301

中图分类号：

R99 [毒物学（毒理学）];

学科分类号：

100405 ;

摘要：

Animal toxicity testing is time and resource intensive, making it difficult to keep pace with the number of substances requiring assessment. Machine learning (ML) models that use chemical structure information and high-throughput experimental data can be helpful in predicting potential toxicity. However, much of the toxicity data used to train ML models is biased with an unequal balance of positives and negatives primarily since substances selected for in vivo testing are expected to elicit some toxicity effect. To investigate the impact this bias had on predictive performance, various sampling approaches were used to balance in vivo toxicity data as part of a supervised ML workflow to predict hepatotoxicity outcomes from chemical structure and/or targeted transcriptomic data. From the chronic, subchronic, developmental, multigenerational reproductive, and subacute repeat-dose testing toxicity outcomes with a minimum of 50 positive and 50 negative substances, 18 different study-toxicity outcome combinations were evaluated in up to 7 ML models. These included Artificial Neural Networks, Random Forests, Bernouilli Naive Bayes, Gradient Boosting, and Support Vector classification algorithms which were compared with a local approach, Generalised Read-Across (GenRA), a similarity-weighted kNearest Neighbour (k-NN) method. The mean CV F1 performance for unbalanced data across all classifiers and descriptors for chronic liver effects was 0.735 (0.0395 SD). Mean CV F1 performance dropped to 0.639 (0.073 SD) with over-sampling approaches though the poorer performance of KNN approaches in some cases contributed to the observed decrease (mean CV F1 performance excluding KNN was 0.697 (0.072 SD)). With undersampling approaches, the mean CV F1 was 0.523 (0.083 SD). For developmental liver effects, the mean CV F1 performance was much lower with 0.089 (0.111 SD) for unbalanced approaches and 0.149 (0.084 SD) for undersampling. Over-sampling approaches led to an increase in mean CV F1 performance (0.234, (0.107 SD)) for developmental liver toxicity. Model performance was found to be dependent on dataset, model type, balancing approach and feature selection. Accordingly tailoring ML workflows for predicting toxicity should consider class imbalance and rely on simple classifiers first.

引用

页数：14

共 50 条

[41] Predicting liver cancer on epigenomics data using machine learning
Vekariya, Vishalkumar
Passi, Kalpdrum
Jain, Chakresh Kumar
FRONTIERS IN BIOINFORMATICS, 2022, 2
[42] Predicting Student Performance Using Machine Learning in fNIRS Data
Oku, Amanda Yumi Ambriola
Sato, Joao Ricardo
FRONTIERS IN HUMAN NEUROSCIENCE, 2021, 15
[43] Comparison of machine learning approaches for structure-function modeling in glaucoma
Wong, Damon
Chua, Jacqueline
Bujor, Inna
Chong, Rachel S.
Nongpiur, Monisha E.
Vithana, Eranga N.
Husain, Rahat
Aung, Tin
Popa-Cherecheanu, Alina
Schmetterer, Leopold
ANNALS OF THE NEW YORK ACADEMY OF SCIENCES, 2022, 1515 (01) : 237 - 248
[44] Computational Models Using Multiple Machine Learning Algorithms for Predicting Drug Hepatotoxicity with the DILIrank Dataset
Ancuceanu, Robert
Hovanet, Marilena Viorica
Anghel, Adriana Iuliana
Furtunescu, Florentina
Neagu, Monica
Constantin, Carolina
Dinu, Mihaela
INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2020, 21 (06)
[45] Predicting operon and regulon structure in Archaeoglobus fulgidus using transcriptomic data.
Rohlin, L
Sabatti, C
Liao, JC
ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2003, 225 : U204 - U204
[46] Predicting the secondary structure of proteins using Machine Learning algorithms
Camacho, Rui
Ferreira, Rita
Rosa, Natacha
Guimaraes, Vania
Fonseca, Nuno A.
Costa, Vitor Santos
de Sousa, Miguel
Magalhaes, Alexandre
INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2012, 6 (06) : 571 - 584
[47] Challenges in Electroencephalography Data Processing Using Machine Learning Approaches
Alvi, Ashik Mostafa
Siuly, Siuly
Wang, Hua
DATABASES THEORY AND APPLICATIONS (ADC 2022), 2022, 13459 : 177 - 184
[48] Analysis of Data Using Machine Learning Approaches in Social Networks
Ertam, Fatih
2017 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2017, : 812 - 815
[49] Predicting rock mass strength from drilling data using synergistic unsupervised and supervised machine learning approaches
Komadja, Gbetoglo Charles
Westman, Erik
Rana, Aditya
Vitalis, Anye
EARTH SCIENCE INFORMATICS, 2025, 18 (03)
[50] Potential of machine learning approaches for predicting mechanical properties of spruce wood in the transverse direction
Shuoye Chen
Rei Shiina
Kazushi Nakai
Tatsuya Awano
Arata Yoshinaga
Junji Sugiyama
Journal of Wood Science, 69

← 1 2 3 4 5 →