Large-Scale Modeling of Sparse Protein Kinase Activity Data

被引:4
|
作者
Luukkonen, Sohvi [1 ]
Meijer, Erik [1 ]
Tricarico, Giovanni A. [2 ]
Hofmans, Johan [2 ]
Stouten, Pieter F. W. [1 ,2 ,3 ]
van Westen, Gerard J. P. [1 ]
Lenselink, Eelke B. [2 ]
机构
[1] Leiden Univ, Leiden Acad Ctr Drug Res, NL-2333 CC Leiden, Netherlands
[2] Galapagos NV, B-2800 Mechelen, Belgium
[3] Stouten Pharma Consultancy BV, B-2860 St Katelijne Waver, Belgium
基金
荷兰研究理事会;
关键词
INHIBITOR; QSAR;
D O I
10.1021/acs.jcim.3c00132
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Protein kinases are a protein family that plays an importantrolein several complex diseases such as cancer and cardiovascular andimmunological diseases. Protein kinases have conserved ATP bindingsites, which when targeted can lead to similar activities of inhibitorsagainst different kinases. This can be exploited to create multitargetdrugs. On the other hand, selectivity (lack of similar activities)is desirable in order to avoid toxicity issues. There is a vast amountof protein kinase activity data in the public domain, which can beused in many different ways. Multitask machine learning models areexpected to excel for these kinds of data sets because they can learnfrom implicit correlations between tasks (in this case activitiesagainst a variety of kinases). However, multitask modeling of sparsedata poses two major challenges: (i) creating a balanced train-testsplit without data leakage and (ii) handling missing data. In thiswork, we construct a protein kinase benchmark set composed of twobalanced splits without data leakage, using random and dissimilarity-drivencluster-based mechanisms, respectively. This data set can be usedfor benchmarking and developing protein kinase activity predictionmodels. Overall, the performance on the dissimilarity-driven cluster-basedsplit is lower than on random split-based sets for all models, indicatingpoor generalizability of models. Nevertheless, we show that multitaskdeep learning models, on this very sparse data set, outperform single-taskdeep learning and tree-based models. Finally, we demonstrate thatdata imputation does not improve the performance of (multitask) modelson this benchmark set.
引用
下载
收藏
页码:3688 / 3696
页数:9
相关论文
共 50 条
  • [21] Large-scale protein structure modeling of the Saccharomyces cerevisiae genome
    Sánchez, R
    Sali, A
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 1998, 95 (23) : 13597 - 13602
  • [22] A new method for modeling large-scale rearrangements of protein domains
    Maiorov, V
    Abagyan, R
    PROTEINS-STRUCTURE FUNCTION AND GENETICS, 1997, 27 (03): : 410 - 424
  • [23] Comparative assessment of large-scale data sets of protein–protein interactions
    Christian von Mering
    Roland Krause
    Berend Snel
    Michael Cornell
    Stephen G. Oliver
    Stanley Fields
    Peer Bork
    Nature, 2002, 417 : 399 - 403
  • [24] A Large-Scale Network Data Analysis via Sparse and Low Rank Reconstruction
    Lu, Liang Fu
    Huang, Zheng-Hai
    Ambusaidi, Mohammed A.
    Gou, Kui-Xiang
    DISCRETE DYNAMICS IN NATURE AND SOCIETY, 2014, 2014
  • [25] Quantile regression for large-scale data via sparse exponential transform method
    Xu, Q. F.
    Cai, C.
    Jiang, C. X.
    Huang, X.
    STATISTICS, 2019, 53 (01) : 26 - 42
  • [26] Sparse data-driven wavefront prediction for large-scale adaptive optics
    Cerqueira, Paulo
    Piscaer, Pieter
    Verhaegen, Michel
    JOURNAL OF THE OPTICAL SOCIETY OF AMERICA A-OPTICS IMAGE SCIENCE AND VISION, 2021, 38 (07) : 992 - 1002
  • [27] Large-scale CyTOF data modeling of leukemia patient cohorts
    Kong, Garth
    Vu, Tania
    Lind, Evan
    Nikolova, Olga H.
    CANCER RESEARCH, 2023, 84 (06)
  • [28] A data parallel approach for large-scale Gaussian process modeling
    Choudhury, A
    Nair, PB
    Keane, AJ
    PROCEEDINGS OF THE SECOND SIAM INTERNATIONAL CONFERENCE ON DATA MINING, 2002, : 95 - 111
  • [29] Cognitive Modeling With Representations From Large-Scale Digital Data
    Bhatia, Sudeep
    Aka, Ada
    CURRENT DIRECTIONS IN PSYCHOLOGICAL SCIENCE, 2022, 31 (03) : 207 - 214
  • [30] Automated Protocol for Large-Scale Modeling of Gene Expression Data
    Hall, Michelle Lynn
    Calkins, David
    Sherman, Woody
    Journal of Chemical Information and Modeling, 2016, 56 (11) : 2216 - 2224