Scaling Machine Learning for Target Prediction in Drug Discovery using Apache Spark

被引:5
|
作者
Harnie, Dries [1 ,3 ]
Vapirev, Alexander E. [2 ,3 ]
Wegner, Jorg Kurt [2 ]
Gedich, Andrey [6 ]
Steijaert, Marvin [7 ]
Wuyts, Roel [3 ,4 ,5 ]
De Meuter, Wolfgang [1 ]
机构
[1] Vrije Univ Brussel, Software Languages Lab, Pl Laan 2, B-1050 Brussels, Belgium
[2] Janssen Pharmaceut, B-2340 Beerse, Belgium
[3] ExaSci Life Lab, B-3001 Leuven, Belgium
[4] IMEC, B-3001 Leuven, Belgium
[5] Katholieke Univ Leuven, DistriNet, B-3001 Leuven, Belgium
[6] ARCADIA Inc, Rostra Business Ctr, St Petersburg 195112, Russia
[7] OpenAnalytics, B-2220 Heist Op Den Berg, Belgium
关键词
IDENTIFICATION; TOOL;
D O I
10.1109/CCGrid.2015.50
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
In the context of drug discovery, a key problem is the identification of candidate molecules that affect proteins associated with diseases. Inside Janssen Pharmaceutica, the Chemogenomics project aims to derive new candidates from existing experiments through a set of machine learning predictor programs, written in single-node C++. These programs take a long time to run and are inherently parallel, but do not use multiple nodes. We show how we reimplemented the pipeline using Apache Spark, which enabled us to lift the existing programs to a multi-node cluster without making changes to the predictors. We have benchmarked our Spark pipeline against the original, which shows almost linear speedup up to 8 nodes. In addition, our pipeline generates fewer intermediate files while allowing easier checkpointing and monitoring.
引用
收藏
页码:871 / 879
页数:9
相关论文
共 50 条
  • [41] Performance evaluation of DNN with other machine learning techniques in a cluster using Apache Spark and MLlib
    JayaLakshmi, A. N. M.
    Kishore, K. V. Krishna
    JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (01) : 1311 - 1319
  • [42] Large-Scale Music Genre Analysis and Classification Using Machine Learning with Apache Spark
    Chaudhury, Mousumi
    Karami, Amin
    Ghazanfar, Mustansar Ali
    ELECTRONICS, 2022, 11 (16)
  • [43] Performance Evaluation of Machine Learning Algorithms in Apache Spark for Intrusion Detection
    Dobson, Anthony
    Roy, Kaushik
    Yuan, Xiaohong
    Xu, Jinsheng
    2018 28TH INTERNATIONAL TELECOMMUNICATION NETWORKS AND APPLICATIONS CONFERENCE (ITNAC), 2018, : 374 - 379
  • [44] Model averaging in distributed machine learning: a case study with Apache Spark
    Guo, Yunyan
    Zhang, Zhipeng
    Jiang, Jiawei
    Wu, Wentao
    Zhang, Ce
    Cui, Bin
    Li, Jianzhong
    VLDB JOURNAL, 2021, 30 (04): : 693 - 712
  • [45] Machine Learning in Drug Discovery
    Hochreiter, Sepp
    Klambauer, Guenter
    Rarey, Matthias
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2018, 58 (09) : 1723 - 1724
  • [47] Machine Learning in Drug Discovery
    Klambauer, Guenter
    Hochreiter, Sepp
    Rarey, Matthias
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2019, 59 (03) : 945 - 946
  • [48] Cloud-agnostic architectures for machine learning based on Apache Spark
    Nagy, Eniko
    Lovas, Robert
    Pintye, Istvan
    Hajnal, Akos
    Kacsuk, Peter
    ADVANCES IN ENGINEERING SOFTWARE, 2021, 159
  • [49] Understanding and Optimizing the Performance of Distributed Machine Learning Applications on Apache Spark
    Dunner, Celestine
    Parnell, Thomas
    Atasu, Kubilay
    Sifalakis, Manolis
    Pozidis, Haralampos
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 331 - 338
  • [50] Machine learning approach on apache spark for credit card fraud detection
    Santosh T.
    Ramesh D.
    Ingenierie des Systemes d'Information, 2020, 25 (01): : 101 - 106