Efficient iterative virtual screening with Apache Spark and conformal prediction

被引：30

作者：

Ahmed, Laeeq ^{[1
]}

Georgiev, Valentin ^{[2
]}

Capuccini, Marco ^{[2
,3
]}

Toor, Salman ^{[3
]}

Schaal, Wesley ^{[2
]}

Laure, Erwin ^{[1
]}

Spjuth, Ola ^{[2
]}

机构：

[1] Royal Inst Technol KTH, Dept Computat Sci & Technol, Lindstedtsvagen 5, S-10044 Stockholm, Sweden

[2] Uppsala Univ, Dept Pharmaceut Biosci, Box 591, S-75124 Uppsala, Sweden

[3] Uppsala Univ, Dept Informat Technol, Box 337, S-75105 Uppsala, Sweden

来源：

JOURNAL OF CHEMINFORMATICS | 2018年 / 10卷

关键词：

Virtual screening; Docking; Conformal prediction; Cloud computing; Apache Spark; DRUG DISCOVERY; LARGE-SCALE; BENCHMARKING; DOCKING; QSAR;

D O I：

10.1186/s13321-018-0265-z

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Background: Docking and scoring large libraries of ligands against target proteins forms the basis of structure-based virtual screening. The problem is trivially parallelizable, and calculations are generally carried out on computer clusters or on large workstations in a brute force manner, by docking and scoring all available ligands. Contribution: In this study we propose a strategy that is based on iteratively docking a set of ligands to form a training set, training a ligand-based model on this set, and predicting the remainder of the ligands to exclude those predicted as 'low-scoring' ligands. Then, another set of ligands are docked, the model is retrained and the process is repeated until a certain model efficiency level is reached. Thereafter, the remaining ligands are docked or excluded based on this model. We use SVM and conformal prediction to deliver valid prediction intervals for ranking the predicted ligands, and Apache Spark to parallelize both the docking and the modeling. Results: We show on 4 different targets that conformal prediction based virtual screening (CPVS) is able to reduce the number of docked molecules by 62.61% while retaining an accuracy for the top 30 hits of 94% on average and a speedup of 3.7. The implementation is available as open source via GitHub (https://github.com/laeeq80/spark-cpvs) and can be run on high-performance computers as well as on cloud resources.

引用

页数：8

共 50 条

[21] Efficient Distributed Range Query Processing in Apache Spark
Papadopoulos, Apostolos N.
Sioutas, Spyros
Zacharatos, Nikolaos
Zaroliagis, Christos
2019 19TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2019, : 569 - 575
[22] Efficient Large Scale NLP Feature Engineering with Apache Spark
Esmaeilzadeh, Armin
Heidari, Maryam
Abdolazimi, Reyhaneh
Hajibabaee, Parisa
Malekzadeh, Masoud
2022 IEEE 12TH ANNUAL COMPUTING AND COMMUNICATION WORKSHOP AND CONFERENCE (CCWC), 2022, : 274 - 280
[23] Performance Prediction for Data-driven Workflows on Apache Spark
Gulino, Andrea
Canakoglu, Arif
Ceri, Stefano
Ardagna, Danilo
2020 IEEE 28TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS (MASCOTS 2020), 2020, : 167 - +
[24] Efficient Fuzz Testing for Apache Spark Using Framework Abstraction
Zhang, Qian
Wang, Jiyuan
Gulzar, Muhammad Ali
Padhye, Rohan
Kim, Miryung
2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: COMPANION PROCEEDINGS (ICSE-COMPANION 2021), 2021, : 61 - 64
[25] Gray Box Modeling Methodology for Runtime Prediction of Apache Spark Jobs
Al-Sayeh, Hani
Sattler, Kai-Uwe
2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING WORKSHOPS (ICDEW 2019), 2019, : 117 - 124
[26] Applying Apache Spark on Streaming Big Data for Health Status Prediction
Ebada, Ahmed Ismail
Elhenawy, Ibrahim
Jeong, Chang-Won
Nam, Yunyoung
Elbakry, Hazem
Abdelrazek, Samir
CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (02): : 3511 - 3527
[27] GeoMatch: Efficient Large-Scale Map Matching on Apache Spark
Zeidan, Ayman
Lagerspetz, Eemil
Zhao, Kai
Nurmi, Petteri
Tarkoma, Sasu
Vo, Huy T.
2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 384 - 391
[28] Efficient Matrix Computation for SGD-Based Algorithms on Apache Spark
Han, Baokun
Chen, Zihao
Xu, Chen
Zhou, Aoying
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2022, PT I, 2022, : 309 - 324
[29] Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark
Xu, Bo
Li, Changlong
Zhuang, Hang
Wang, Jiali
Wang, Qingfeng
Zhou, Xuehai
2017 IEEE 10TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD), 2017, : 608 - 615
[30] GeoMatch: Efficient Large-scale Map Matching on Apache Spark
Zeidan, Ayman
Lagerspetz, Eemil
Zhao, Kai
Nurmi, Petteri
Tarkoma, Sasu
Vo, Huy T.
ACM/IMS Transactions on Data Science, 2020, 1 (03):

← 1 2 3 4 5 →