Efficient iterative virtual screening with Apache Spark and conformal prediction

被引:30
|
作者
Ahmed, Laeeq [1 ]
Georgiev, Valentin [2 ]
Capuccini, Marco [2 ,3 ]
Toor, Salman [3 ]
Schaal, Wesley [2 ]
Laure, Erwin [1 ]
Spjuth, Ola [2 ]
机构
[1] Royal Inst Technol KTH, Dept Computat Sci & Technol, Lindstedtsvagen 5, S-10044 Stockholm, Sweden
[2] Uppsala Univ, Dept Pharmaceut Biosci, Box 591, S-75124 Uppsala, Sweden
[3] Uppsala Univ, Dept Informat Technol, Box 337, S-75105 Uppsala, Sweden
来源
JOURNAL OF CHEMINFORMATICS | 2018年 / 10卷
关键词
Virtual screening; Docking; Conformal prediction; Cloud computing; Apache Spark; DRUG DISCOVERY; LARGE-SCALE; BENCHMARKING; DOCKING; QSAR;
D O I
10.1186/s13321-018-0265-z
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Background: Docking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner, by docking and scoring all available ligands. Contribution: In this study we propose a strategy that is based on iteratively docking a set of ligands to form a training set, training a ligand-based model on this set, and predicting the remainder of the ligands to exclude those predicted as 'low-scoring' ligands. Then, another set of ligands are docked, the model is retrained and the process is repeated until a certain model efficiency level is reached. Thereafter, the remaining ligands are docked or excluded based on this model. We use SVM and conformal prediction to deliver valid prediction intervals for ranking the predicted ligands, and Apache Spark to parallelize both the docking and the modeling. Results: We show on 4 different targets that conformal prediction based virtual screening (CPVS) is able to reduce the number of docked molecules by 62.61% while retaining an accuracy for the top 30 hits of 94% on average and a speedup of 3.7. The implementation is available as open source via GitHub (https://github.com/laeeq80/spark-cpvs) and can be run on high-performance computers as well as on cloud resources.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] Efficient Distributed Range Query Processing in Apache Spark
    Papadopoulos, Apostolos N.
    Sioutas, Spyros
    Zacharatos, Nikolaos
    Zaroliagis, Christos
    2019 19TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2019, : 569 - 575
  • [22] Efficient Large Scale NLP Feature Engineering with Apache Spark
    Esmaeilzadeh, Armin
    Heidari, Maryam
    Abdolazimi, Reyhaneh
    Hajibabaee, Parisa
    Malekzadeh, Masoud
    2022 IEEE 12TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2022, : 274 - 280
  • [23] Performance Prediction for Data-driven Workflows on Apache Spark
    Gulino, Andrea
    Canakoglu, Arif
    Ceri, Stefano
    Ardagna, Danilo
    2020 IEEE 28TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS (MASCOTS 2020), 2020, : 167 - +
  • [24] Efficient Fuzz Testing for Apache Spark Using Framework Abstraction
    Zhang, Qian
    Wang, Jiyuan
    Gulzar, Muhammad Ali
    Padhye, Rohan
    Kim, Miryung
    2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: COMPANION PROCEEDINGS (ICSE-COMPANION 2021), 2021, : 61 - 64
  • [25] Gray Box Modeling Methodology for Runtime Prediction of Apache Spark Jobs
    Al-Sayeh, Hani
    Sattler, Kai-Uwe
    2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDEW 2019), 2019, : 117 - 124
  • [26] Applying Apache Spark on Streaming Big Data for Health Status Prediction
    Ebada, Ahmed Ismail
    Elhenawy, Ibrahim
    Jeong, Chang-Won
    Nam, Yunyoung
    Elbakry, Hazem
    Abdelrazek, Samir
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (02): : 3511 - 3527
  • [27] GeoMatch: Efficient Large-Scale Map Matching on Apache Spark
    Zeidan, Ayman
    Lagerspetz, Eemil
    Zhao, Kai
    Nurmi, Petteri
    Tarkoma, Sasu
    Vo, Huy T.
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 384 - 391
  • [28] Efficient Matrix Computation for SGD-Based Algorithms on Apache Spark
    Han, Baokun
    Chen, Zihao
    Xu, Chen
    Zhou, Aoying
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2022, PT I, 2022, : 309 - 324
  • [29] Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark
    Xu, Bo
    Li, Changlong
    Zhuang, Hang
    Wang, Jiali
    Wang, Qingfeng
    Zhou, Xuehai
    2017 IEEE 10TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD), 2017, : 608 - 615
  • [30] GeoMatch: Efficient Large-scale Map Matching on Apache Spark
    Zeidan, Ayman
    Lagerspetz, Eemil
    Zhao, Kai
    Nurmi, Petteri
    Tarkoma, Sasu
    Vo, Huy T.
    ACM/IMS Transactions on Data Science, 2020, 1 (03):