Performance comparison of Dask and Apache Spark on HPC systems for neuroimaging

被引：0

作者：

Dugre, Mathieu ^{[1
]}

Hayot-Sasson, Valerie ^{[1
]}

Glatard, Tristan ^{[1
]}

机构：

[1] Concordia Univ, Dept Comp Sci & Software Engn, Montreal, PQ, Canada

来源：

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2023年 / 35卷 / 21期

基金：

加拿大创新基金会; 加拿大自然科学与工程研究理事会;

关键词：

Big Data; Dask; HPC; neuroimaging; Spark; performance;

D O I：

10.1002/cpe.7635

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

The general increase in data size and data sharing motivates the adoption of Big Data strategies in several scientific disciplines. However, while several options are available, no particular guidelines exist for selecting a Big Data engine. In this paper, we compare the runtime performance of two popular Big Data engines with Python APIs, Apache Spark, and Dask, in processing neuroimaging pipelines. Our experiments use three synthetic neuroimaging applications to process the 606 GB BigBrain image and an actual pipeline to process data from thousands of anatomical images. We benchmark these applications on a dedicated HPC cluster running the Lustre file system while using varying combinations of the number of nodes, file size, and task duration. Our results show that although there are slight differences between Dask and Spark, the performance of the engines is comparable for data-intensive applications. However, Spark requires more memory than Dask, which can lead to slower runtime depending on configuration and infrastructure. In general, the limiting factor was the data transfer time. While both engines are suitable for neuroimaging, more efforts need to be put to reduce the data transfer time and the memory footprint of applications.

引用

页数：15

共 50 条

[21] Data Processing Performance of Apache Spark on Beowulf Clusters: An Overview
Cluci, Marius-Iulian
Fotache, Mann
Greavu-Serban, Valerica
VISION 2025: EDUCATION EXCELLENCE AND MANAGEMENT OF INNOVATIONS THROUGH SUSTAINABLE ECONOMIC COMPETITIVE ADVANTAGE, 2019, : 12929 - 12938
[22] Performance Prediction for Data-driven Workflows on Apache Spark
Gulino, Andrea
Canakoglu, Arif
Ceri, Stefano
Ardagna, Danilo
2020 IEEE 28TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS (MASCOTS 2020), 2020, : 167 - +
[23] Adaptive performance model for dynamic scaling Apache Spark Streaming
Petrov, Max
Butakov, Nikolay
Nasonov, Denis
Melnik, Mikhail
7TH INTERNATIONAL YOUNG SCIENTISTS CONFERENCE ON COMPUTATIONAL SCIENCE, YSC2018, 2018, 136 : 109 - 117
[24] PerTract: Model Extraction and Specification of Big Data Systems for Performance Prediction by the Example of Apache Spark and Hadoop
Kross, Johannes
Krcmar, Helmut
BIG DATA AND COGNITIVE COMPUTING, 2019, 3 (03) : 1 - 24
[25] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
Ahmed, N.
Barczak, Andre L. C.
Susnjak, Teo
Rashid, Mohammed A.
JOURNAL OF BIG DATA, 2020, 7 (01)
[26] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
N. Ahmed
Andre L. C. Barczak
Teo Susnjak
Mohammed A. Rashid
Journal of Big Data, 7
[27] Performance Evaluation of Machine Learning Algorithms in Apache Spark for Intrusion Detection
Dobson, Anthony
Roy, Kaushik
Yuan, Xiaohong
Xu, Jinsheng
2018 28TH INTERNATIONAL TELECOMMUNICATION NETWORKS AND APPLICATIONS CONFERENCE (ITNAC), 2018, : 374 - 379
[28] Understanding and Optimizing the Performance of Distributed Machine Learning Applications on Apache Spark
Dunner, Celestine
Parnell, Thomas
Atasu, Kubilay
Sifalakis, Manolis
Pozidis, Haralampos
2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 331 - 338
[29] Is Intel High Performance Analytics Toolkit a good alternative to Apache Spark?
de Carvalho, Rafael Aquino
Goldman, Alfredo
Cavalheiro, Gerson Geraldo H.
2017 IEEE 16TH INTERNATIONAL SYMPOSIUM ON NETWORK COMPUTING AND APPLICATIONS (NCA), 2017, : 171 - 178
[30] A Model Driven Approach towards Improving the Performance of Apache Spark Applications
Wang, Kewen
Khan, Mohammad Maifi Hasan
Nhan Nguyen
Gokhale, Swapna
2019 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS), 2019, : 233 - 242

← 1 2 3 4 5 →