Performance comparison of Dask and Apache Spark on HPC systems for neuroimaging

被引:0
|
作者
Dugre, Mathieu [1 ]
Hayot-Sasson, Valerie [1 ]
Glatard, Tristan [1 ]
机构
[1] Concordia Univ, Dept Comp Sci & Software Engn, Montreal, PQ, Canada
来源
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE | 2023年 / 35卷 / 21期
基金
加拿大创新基金会; 加拿大自然科学与工程研究理事会;
关键词
Big Data; Dask; HPC; neuroimaging; Spark; performance;
D O I
10.1002/cpe.7635
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The general increase in data size and data sharing motivates the adoption of Big Data strategies in several scientific disciplines. However, while several options are available, no particular guidelines exist for selecting a Big Data engine. In this paper, we compare the runtime performance of two popular Big Data engines with Python APIs, Apache Spark, and Dask, in processing neuroimaging pipelines. Our experiments use three synthetic neuroimaging applications to process the 606 GB BigBrain image and an actual pipeline to process data from thousands of anatomical images. We benchmark these applications on a dedicated HPC cluster running the Lustre file system while using varying combinations of the number of nodes, file size, and task duration. Our results show that although there are slight differences between Dask and Spark, the performance of the engines is comparable for data-intensive applications. However, Spark requires more memory than Dask, which can lead to slower runtime depending on configuration and infrastructure. In general, the limiting factor was the data transfer time. While both engines are suitable for neuroimaging, more efforts need to be put to reduce the data transfer time and the memory footprint of applications.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] Data Processing Performance of Apache Spark on Beowulf Clusters: An Overview
    Cluci, Marius-Iulian
    Fotache, Mann
    Greavu-Serban, Valerica
    VISION 2025: EDUCATION EXCELLENCE AND MANAGEMENT OF INNOVATIONS THROUGH SUSTAINABLE ECONOMIC COMPETITIVE ADVANTAGE, 2019, : 12929 - 12938
  • [22] Performance Prediction for Data-driven Workflows on Apache Spark
    Gulino, Andrea
    Canakoglu, Arif
    Ceri, Stefano
    Ardagna, Danilo
    2020 IEEE 28TH INTERNATIONAL SYMPOSIUM ON MODELING, ANALYSIS, AND SIMULATION OF COMPUTER AND TELECOMMUNICATION SYSTEMS (MASCOTS 2020), 2020, : 167 - +
  • [23] Adaptive performance model for dynamic scaling Apache Spark Streaming
    Petrov, Max
    Butakov, Nikolay
    Nasonov, Denis
    Melnik, Mikhail
    7TH INTERNATIONAL YOUNG SCIENTISTS CONFERENCE ON COMPUTATIONAL SCIENCE, YSC2018, 2018, 136 : 109 - 117
  • [24] PerTract: Model Extraction and Specification of Big Data Systems for Performance Prediction by the Example of Apache Spark and Hadoop
    Kross, Johannes
    Krcmar, Helmut
    BIG DATA AND COGNITIVE COMPUTING, 2019, 3 (03) : 1 - 24
  • [25] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
    Ahmed, N.
    Barczak, Andre L. C.
    Susnjak, Teo
    Rashid, Mohammed A.
    JOURNAL OF BIG DATA, 2020, 7 (01)
  • [26] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
    N. Ahmed
    Andre L. C. Barczak
    Teo Susnjak
    Mohammed A. Rashid
    Journal of Big Data, 7
  • [27] Performance Evaluation of Machine Learning Algorithms in Apache Spark for Intrusion Detection
    Dobson, Anthony
    Roy, Kaushik
    Yuan, Xiaohong
    Xu, Jinsheng
    2018 28TH INTERNATIONAL TELECOMMUNICATION NETWORKS AND APPLICATIONS CONFERENCE (ITNAC), 2018, : 374 - 379
  • [28] Understanding and Optimizing the Performance of Distributed Machine Learning Applications on Apache Spark
    Dunner, Celestine
    Parnell, Thomas
    Atasu, Kubilay
    Sifalakis, Manolis
    Pozidis, Haralampos
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 331 - 338
  • [29] Is Intel High Performance Analytics Toolkit a good alternative to Apache Spark?
    de Carvalho, Rafael Aquino
    Goldman, Alfredo
    Cavalheiro, Gerson Geraldo H.
    2017 IEEE 16TH INTERNATIONAL SYMPOSIUM ON NETWORK COMPUTING AND APPLICATIONS (NCA), 2017, : 171 - 178
  • [30] A Model Driven Approach towards Improving the Performance of Apache Spark Applications
    Wang, Kewen
    Khan, Mohammad Maifi Hasan
    Nhan Nguyen
    Gokhale, Swapna
    2019 IEEE INTERNATIONAL SYMPOSIUM ON PERFORMANCE ANALYSIS OF SYSTEMS AND SOFTWARE (ISPASS), 2019, : 233 - 242