Performance comparison of Dask and Apache Spark on HPC systems for neuroimaging

被引:0
|
作者
Dugre, Mathieu [1 ]
Hayot-Sasson, Valerie [1 ]
Glatard, Tristan [1 ]
机构
[1] Concordia Univ, Dept Comp Sci & Software Engn, Montreal, PQ, Canada
来源
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2023年 / 35卷 / 21期
基金
加拿大创新基金会; 加拿大自然科学与工程研究理事会;
关键词
Big Data; Dask; HPC; neuroimaging; Spark; performance;
D O I
10.1002/cpe.7635
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The general increase in data size and data sharing motivates the adoption of Big Data strategies in several scientific disciplines. However, while several options are available, no particular guidelines exist for selecting a Big Data engine. In this paper, we compare the runtime performance of two popular Big Data engines with Python APIs, Apache Spark, and Dask, in processing neuroimaging pipelines. Our experiments use three synthetic neuroimaging applications to process the 606 GB BigBrain image and an actual pipeline to process data from thousands of anatomical images. We benchmark these applications on a dedicated HPC cluster running the Lustre file system while using varying combinations of the number of nodes, file size, and task duration. Our results show that although there are slight differences between Dask and Spark, the performance of the engines is comparable for data-intensive applications. However, Spark requires more memory than Dask, which can lead to slower runtime depending on configuration and infrastructure. In general, the limiting factor was the data transfer time. While both engines are suitable for neuroimaging, more efforts need to be put to reduce the data transfer time and the memory footprint of applications.
引用
收藏
页数:15
相关论文
共 50 条
  • [31] Performance Analysis of Network Intrusion Detection Schemes using Apache Spark
    Kulariya, Manish
    Saraf, Priyanka
    Ranjan, Raushan
    Gupta, Govind P.
    2016 INTERNATIONAL CONFERENCE ON COMMUNICATION AND SIGNAL PROCESSING (ICCSP), VOL. 1, 2016, : 1973 - 1977
  • [32] HPC-ABDS High Performance Computing Enhanced Apache Big Data Stack
    Fox, Geoffrey C.
    Qiu, Judy
    Kamburugamuve, Supun
    Jha, Shantenu
    Luckow, Andre
    2015 15TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING, 2015, : 1057 - 1066
  • [33] PERFORMANCE COMPARISON OF APACHE SPARK AND HADOOP FOR MACHINE LEARNING BASED ITERATIVE GBTR ON HIGGS AND COVID-19 DATASETS
    Sewal P.
    Singh H.
    Scalable Computing, 2024, 25 (03): : 1373 - 1386
  • [34] PERFORMANCE COMPARISON OF APACHE SPARK AND HADOOP FOR MACHINE LEARNING BASED ITERATIVE GBTR ON HIGGS AND COVID-19 DATASETS
    Sewal, Piyush
    Singh, Hari
    SCALABLE COMPUTING-PRACTICE AND EXPERIENCE, 2024, 25 (03): : 1373 - 1386
  • [35] An Empirical Comparison of Three Ensemble Methods for Medical Data Mining with Apache Spark
    Hua, Yiang
    Pan, Jian
    Yan, Zhaofeng
    Qiu, Yunwei
    IEEE 12TH INT CONF UBIQUITOUS INTELLIGENCE & COMP/IEEE 12TH INT CONF ADV & TRUSTED COMP/IEEE 15TH INT CONF SCALABLE COMP & COMMUN/IEEE INT CONF CLOUD & BIG DATA COMP/IEEE INT CONF INTERNET PEOPLE AND ASSOCIATED SYMPOSIA/WORKSHOPS, 2015, : 917 - 922
  • [36] Comparison of the HPC and Big Data Java']Java Libraries Spark, PCJ and APGAS
    Posner, Jonas
    Reitz, Lukas
    Fohry, Claudia
    PROCEEDINGS OF PAW-ATM18: 2018 IEEE/ACM PARALLEL APPLICATIONS WORKSHOP, ALTERNATIVES TO MPI (PAW-ATM), 2018, : 11 - 22
  • [37] A COMPARISON OF MACHINE LEARNING TECHNIQUES FOR ANDROID MALWARE DETECTION USING APACHE SPARK
    Memon, Laraib U.
    Bawany, Narmeen Z.
    Shamsi, Jawwad A.
    JOURNAL OF ENGINEERING SCIENCE AND TECHNOLOGY, 2019, 14 (03): : 1572 - 1586
  • [38] Comparison of ranson, APACHE II and APACHE III scoring systems in acute pancreatitis
    Chatzicostas, C
    Roussomoustakaki, M
    Vlachonikolis, IG
    Notas, G
    Mouzas, I
    Samonakis, D
    Kouroumalis, EA
    PANCREAS, 2002, 25 (04) : 331 - 335
  • [39] A Performance Comparison of HPC Workloads on Traditional and Cloud-based HPC Clusters
    Munhoz, Vanderlei
    Bonfils, Antoine
    Castro, Marcio
    Mendizabal, Odorico
    2023 INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE AND HIGH PERFORMANCE COMPUTING WORKSHOPS, SBAC-PADW, 2023, : 108 - 114
  • [40] Performance Analysis of Machine Learning Techniques on Big Data Using Apache Spark
    Mogha, Garima
    Ahlawat, Khyati
    Singh, Amit Prakash
    DATA SCIENCE AND ANALYTICS, 2018, 799 : 17 - 26