Large-scale virtual screening on public cloud resources with Apache Spark

被引：14

作者：

Capuccini, Marco ^{[1
,2
]}

Ahmed, Laeeq ^{[3
]}

Schaal, Wesley ^{[2
]}

Laure, Erwin ^{[3
]}

Spjuth, Ola ^{[2
]}

机构：

[1] Uppsala Univ, Dept Informat Technol, Box 337, S-75105 Uppsala, Sweden

[2] Uppsala Univ, Dept Pharmaceut Biosci, Box 591, S-75124 Uppsala, Sweden

[3] Royal Inst Technol KTH, Dept Computat Sci & Technol, Lindstedtsvagen 5, S-10044 Stockholm, Sweden

来源：

JOURNAL OF CHEMINFORMATICS | 2017年 / 9卷

关键词：

Virtual screening; Docking; Cloud computing; Apache Spark; MAPREDUCE;

D O I：

10.1186/s13321-017-0204-4

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Background: Structure-based virtual screening is an in-silico method to screen a target receptor against a virtual molecular library. Applying docking-based screening to large molecular libraries can be computationally expensive, however it constitutes a trivially parallelizable task. Most of the available parallel implementations are based on message passing interface, relying on low failure rate hardware and fast network connection. Google's MapReduce revolutionized large-scale analysis, enabling the processing of massive datasets on commodity hardware and cloud resources, providing transparent scalability and fault tolerance at the software level. Open source implementations of MapReduce include Apache Hadoop and the more recent Apache Spark. Results: We developed a method to run existing docking-based screening software on distributed cloud resources, utilizing the MapReduce approach. We benchmarked our method, which is implemented in Apache Spark, docking a publicly available target receptor against similar to 2.2 M compounds. The performance experiments show a good parallel efficiency (87%) when running in a public cloud environment. Conclusion: Our method enables parallel Structure-based virtual screening on public cloud resources or commodity computer clusters. The degree of scalability that we achieve allows for trying out our method on relatively small libraries first and then to scale to larger libraries.

引用

页数：6

共 50 条

[1] Large-scale virtual screening on public cloud resources with Apache Spark
Marco Capuccini
Laeeq Ahmed
Wesley Schaal
Erwin Laure
Ola Spjuth
Journal of Cheminformatics, 9
[2] Ensemble Learning for Large Scale Virtual Screening on Apache Spark
Sid, Karima
Batouche, Mohamed
COMPUTATIONAL INTELLIGENCE AND ITS APPLICATIONS, 2018, 522 : 244 - 256
[3] Large-Scale Data Pollution with Apache Spark
Hildebrandt, Kai
Panse, Fabian
Wilcke, Niklas
Ritter, Norbert
IEEE TRANSACTIONS ON BIG DATA, 2020, 6 (02) : 396 - 411
[4] Processing large-scale data with Apache Spark
Ko, Seyoon
Won, Joong-Ho
KOREAN JOURNAL OF APPLIED STATISTICS, 2016, 29 (06) : 1077 - 1094
[5] Large-scale virtual screening experiments on Windows Azure-based cloud resources
Kiss, Tamas
Borsody, Peter
Terstyanszky, Gabor
Winter, Stephen
Greenwell, Pamela
McEldowney, Sharron
Heindl, Hans
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2014, 26 (10): : 1760 - 1770
[6] Large-Scale Network Embedding in Apache Spark
Lin, Wenqing
KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 3271 - 3279
[7] Large-scale text processing pipeline with Apache Spark
Svyatkovskiy, A.
Imai, K.
Kroeger, M.
Shiraito, Y.
2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3928 - 3935
[8] A Cloud-Based Framework for Large-Scale Log Mining through Apache Spark and Elasticsearch
Li, Yun
Jiang, Yongyao
Gu, Juan
Lu, Mingyue
Yu, Manzhu
Armstrong, Edward M.
Huang, Thomas
Moroni, David
McGibbney, Lewis J.
Frank, Greguska
Yang, Chaowei
APPLIED SCIENCES-BASEL, 2019, 9 (06):
[9] GeoMatch: Efficient Large-Scale Map Matching on Apache Spark
Zeidan, Ayman
Lagerspetz, Eemil
Zhao, Kai
Nurmi, Petteri
Tarkoma, Sasu
Vo, Huy T.
2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 384 - 391
[10] Filter Large-scale Engine Data using Apache Spark
Pirozzi, Donato
Scarano, Vittorio
Begg, Steven
De Sercey, Guillaume
Fish, Andrew
Harvey, Andrew
2016 IEEE 14TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2016, : 1300 - 1305

← 1 2 3 4 5 →