Large-Scale Modeling of Sparse Protein Kinase Activity Data

被引:4
|
作者
Luukkonen, Sohvi [1 ]
Meijer, Erik [1 ]
Tricarico, Giovanni A. [2 ]
Hofmans, Johan [2 ]
Stouten, Pieter F. W. [1 ,2 ,3 ]
van Westen, Gerard J. P. [1 ]
Lenselink, Eelke B. [2 ]
机构
[1] Leiden Univ, Leiden Acad Ctr Drug Res, NL-2333 CC Leiden, Netherlands
[2] Galapagos NV, B-2800 Mechelen, Belgium
[3] Stouten Pharma Consultancy BV, B-2860 St Katelijne Waver, Belgium
基金
荷兰研究理事会;
关键词
INHIBITOR; QSAR;
D O I
10.1021/acs.jcim.3c00132
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Protein kinases are a protein family that plays an importantrolein several complex diseases such as cancer and cardiovascular andimmunological diseases. Protein kinases have conserved ATP bindingsites, which when targeted can lead to similar activities of inhibitorsagainst different kinases. This can be exploited to create multitargetdrugs. On the other hand, selectivity (lack of similar activities)is desirable in order to avoid toxicity issues. There is a vast amountof protein kinase activity data in the public domain, which can beused in many different ways. Multitask machine learning models areexpected to excel for these kinds of data sets because they can learnfrom implicit correlations between tasks (in this case activitiesagainst a variety of kinases). However, multitask modeling of sparsedata poses two major challenges: (i) creating a balanced train-testsplit without data leakage and (ii) handling missing data. In thiswork, we construct a protein kinase benchmark set composed of twobalanced splits without data leakage, using random and dissimilarity-drivencluster-based mechanisms, respectively. This data set can be usedfor benchmarking and developing protein kinase activity predictionmodels. Overall, the performance on the dissimilarity-driven cluster-basedsplit is lower than on random split-based sets for all models, indicatingpoor generalizability of models. Nevertheless, we show that multitaskdeep learning models, on this very sparse data set, outperform single-taskdeep learning and tree-based models. Finally, we demonstrate thatdata imputation does not improve the performance of (multitask) modelson this benchmark set.
引用
收藏
页码:3688 / 3696
页数:9
相关论文
共 50 条
  • [1] Functional proteomics: large-scale analysis of protein kinase activity
    Lawrence, David S.
    [J]. GENOME BIOLOGY, 2001, 2 (02):
  • [2] Functional proteomics: large-scale analysis of protein kinase activity
    David S Lawrence
    [J]. Genome Biology, 2 (2):
  • [3] Sparse computation for large-scale data mining
    Hochbaum, Dorit S.
    Baumann, Philipp
    [J]. 2014 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2014, : 354 - 363
  • [4] Very Sparse LSSVM Reductions for Large-Scale Data
    Mall, Raghvendra
    Suykens, Johan A. K.
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2015, 26 (05) : 1086 - 1097
  • [5] Topic modeling for large-scale text data
    Li, Xi-ming
    Ouyang, Ji-hong
    Lu, You
    [J]. FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2015, 16 (06) : 457 - 465
  • [6] Topic modeling for large-scale text data
    Xi-ming Li
    Ji-hong Ouyang
    You Lu
    [J]. Frontiers of Information Technology & Electronic Engineering, 2015, 16 : 457 - 465
  • [7] Modeling and optimizing large-scale data flows
    Woehrer, Alexander
    Brezany, Peter
    Janciak, Ivan
    Mehofer, Eduard
    [J]. FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2014, 31 : 12 - 27
  • [8] Selectivity determinants for protein kinase inhibitors from a systematic analysis and modeling of large-scale structural and panel screening data
    Rai, Brajesh K.
    Wale, Nikil
    Klug-McLeod, Jacquelyn
    Lunney, Elizabeth A.
    Bakken, Gregory A.
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2012, 243
  • [9] Large-Scale Spatiotemporal Fracture Data Completion in Sparse CrowdSensing
    Wang, En
    Zhang, Mijia
    Yang, Bo
    Yang, Yongjian
    Wu, Jie
    [J]. IEEE TRANSACTIONS ON MOBILE COMPUTING, 2024, 23 (07) : 7585 - 7601
  • [10] fast_protein_cluster: parallel and optimized clustering of large-scale protein modeling data
    Hung, Ling-Hong
    Samudrala, Ram
    [J]. BIOINFORMATICS, 2014, 30 (12) : 1774 - 1776