Large-scale virtual screening on public cloud resources with Apache Spark

被引:14
|
作者
Capuccini, Marco [1 ,2 ]
Ahmed, Laeeq [3 ]
Schaal, Wesley [2 ]
Laure, Erwin [3 ]
Spjuth, Ola [2 ]
机构
[1] Uppsala Univ, Dept Informat Technol, Box 337, S-75105 Uppsala, Sweden
[2] Uppsala Univ, Dept Pharmaceut Biosci, Box 591, S-75124 Uppsala, Sweden
[3] Royal Inst Technol KTH, Dept Computat Sci & Technol, Lindstedtsvagen 5, S-10044 Stockholm, Sweden
来源
关键词
Virtual screening; Docking; Cloud computing; Apache Spark; MAPREDUCE;
D O I
10.1186/s13321-017-0204-4
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Background: Structure-based virtual screening is an in-silico method to screen a target receptor against a virtual molecular library. Applying docking-based screening to large molecular libraries can be computationally expensive, however it constitutes a trivially parallelizable task. Most of the available parallel implementations are based on message passing interface, relying on low failure rate hardware and fast network connection. Google's MapReduce revolutionized large-scale analysis, enabling the processing of massive datasets on commodity hardware and cloud resources, providing transparent scalability and fault tolerance at the software level. Open source implementations of MapReduce include Apache Hadoop and the more recent Apache Spark. Results: We developed a method to run existing docking-based screening software on distributed cloud resources, utilizing the MapReduce approach. We benchmarked our method, which is implemented in Apache Spark, docking a publicly available target receptor against similar to 2.2 M compounds. The performance experiments show a good parallel efficiency (87%) when running in a public cloud environment. Conclusion: Our method enables parallel Structure-based virtual screening on public cloud resources or commodity computer clusters. The degree of scalability that we achieve allows for trying out our method on relatively small libraries first and then to scale to larger libraries.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] Large-scale virtual screening on public cloud resources with Apache Spark
    Marco Capuccini
    Laeeq Ahmed
    Wesley Schaal
    Erwin Laure
    Ola Spjuth
    Journal of Cheminformatics, 9
  • [2] Ensemble Learning for Large Scale Virtual Screening on Apache Spark
    Sid, Karima
    Batouche, Mohamed
    COMPUTATIONAL INTELLIGENCE AND ITS APPLICATIONS, 2018, 522 : 244 - 256
  • [3] Large-Scale Data Pollution with Apache Spark
    Hildebrandt, Kai
    Panse, Fabian
    Wilcke, Niklas
    Ritter, Norbert
    IEEE TRANSACTIONS ON BIG DATA, 2020, 6 (02) : 396 - 411
  • [4] Processing large-scale data with Apache Spark
    Ko, Seyoon
    Won, Joong-Ho
    KOREAN JOURNAL OF APPLIED STATISTICS, 2016, 29 (06) : 1077 - 1094
  • [5] Large-scale virtual screening experiments on Windows Azure-based cloud resources
    Kiss, Tamas
    Borsody, Peter
    Terstyanszky, Gabor
    Winter, Stephen
    Greenwell, Pamela
    McEldowney, Sharron
    Heindl, Hans
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2014, 26 (10): : 1760 - 1770
  • [6] Large-Scale Network Embedding in Apache Spark
    Lin, Wenqing
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 3271 - 3279
  • [7] Large-scale text processing pipeline with Apache Spark
    Svyatkovskiy, A.
    Imai, K.
    Kroeger, M.
    Shiraito, Y.
    2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2016, : 3928 - 3935
  • [8] A Cloud-Based Framework for Large-Scale Log Mining through Apache Spark and Elasticsearch
    Li, Yun
    Jiang, Yongyao
    Gu, Juan
    Lu, Mingyue
    Yu, Manzhu
    Armstrong, Edward M.
    Huang, Thomas
    Moroni, David
    McGibbney, Lewis J.
    Frank, Greguska
    Yang, Chaowei
    APPLIED SCIENCES-BASEL, 2019, 9 (06):
  • [9] GeoMatch: Efficient Large-Scale Map Matching on Apache Spark
    Zeidan, Ayman
    Lagerspetz, Eemil
    Zhao, Kai
    Nurmi, Petteri
    Tarkoma, Sasu
    Vo, Huy T.
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 384 - 391
  • [10] Filter Large-scale Engine Data using Apache Spark
    Pirozzi, Donato
    Scarano, Vittorio
    Begg, Steven
    De Sercey, Guillaume
    Fish, Andrew
    Harvey, Andrew
    2016 IEEE 14TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2016, : 1300 - 1305