Performance comparison of Dask and Apache Spark on HPC systems for neuroimaging

被引:0
|
作者
Dugre, Mathieu [1 ]
Hayot-Sasson, Valerie [1 ]
Glatard, Tristan [1 ]
机构
[1] Concordia Univ, Dept Comp Sci & Software Engn, Montreal, PQ, Canada
来源
基金
加拿大自然科学与工程研究理事会; 加拿大创新基金会;
关键词
Big Data; Dask; HPC; neuroimaging; Spark; performance;
D O I
10.1002/cpe.7635
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The general increase in data size and data sharing motivates the adoption of Big Data strategies in several scientific disciplines. However, while several options are available, no particular guidelines exist for selecting a Big Data engine. In this paper, we compare the runtime performance of two popular Big Data engines with Python APIs, Apache Spark, and Dask, in processing neuroimaging pipelines. Our experiments use three synthetic neuroimaging applications to process the 606 GB BigBrain image and an actual pipeline to process data from thousands of anatomical images. We benchmark these applications on a dedicated HPC cluster running the Lustre file system while using varying combinations of the number of nodes, file size, and task duration. Our results show that although there are slight differences between Dask and Spark, the performance of the engines is comparable for data-intensive applications. However, Spark requires more memory than Dask, which can lead to slower runtime depending on configuration and infrastructure. In general, the limiting factor was the data transfer time. While both engines are suitable for neuroimaging, more efforts need to be put to reduce the data transfer time and the memory footprint of applications.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] A performance comparison of Dask and Apache Spark for data-intensive neuroimaging pipelines
    Dugre, Mathieu
    Hayot-Sasson, Valerie
    Glatard, Tristan
    [J]. PROCEEDINGS OF WORKS19: THE 2019 14TH IEEE/ACM WORKFLOWS IN SUPPORT OF LARGE-SCALE SCIENCE (WORKS), 2019, : 40 - 49
  • [2] Performance Comparison of Apache Hadoop and Apache Spark
    Singh, Amritpal
    Khamparia, Aditya
    Luhach, Ashish Kr
    [J]. PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON ADVANCED INFORMATICS FOR COMPUTING RESEARCH (ICAICR '19), 2019,
  • [3] Optimizing Machine Learning on Apache Spark in HPC Environments
    Li, Zhenyu
    Davis, James
    Jarvis, Stephen A.
    [J]. PROCEEDINGS OF 2018 IEEE/ACM MACHINE LEARNING IN HPC ENVIRONMENTS (MLHPC 2018), 2018, : 95 - 105
  • [4] Implementation and Performance Comparison of Partitioning Techniques in Apache Spark
    Geetha, J.
    Harshit, N. G.
    [J]. 2019 10TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT), 2019,
  • [5] On the Performance of Spark on HPC Systems: Towards a Complete Picture
    Yildiz, Orcun
    Ibrahim, Shadi
    [J]. SUPERCOMPUTING FRONTIERS, SCFA 2018, 2018, 10776 : 70 - 89
  • [6] Apache Spark and Apache Ignite Performance Analysis
    Stan, Cristiana-Stefania
    Pandelica, Adrian-Eduard
    Zamfir, Vlad-Andrei
    Stan, Roxana Gabriela
    Negru, Catalin
    [J]. 2019 22ND INTERNATIONAL CONFERENCE ON CONTROL SYSTEMS AND COMPUTER SCIENCE (CSCS), 2019, : 726 - 733
  • [7] Performance Comparison of State of Art NoSql Technologies Using Apache Spark
    ul Haque, Anwar
    Mahmood, Tariq
    Ikram, Nassar
    [J]. INTELLIGENT SYSTEMS AND APPLICATIONS, INTELLISYS, VOL 2, 2019, 869 : 563 - 576
  • [8] Performance Prediction for Apache Spark Platform
    Wang, Kewen
    Khan, Mohammad Maifi Hasan
    [J]. 2015 IEEE 17TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2015 IEEE 7TH INTERNATIONAL SYMPOSIUM ON CYBERSPACE SAFETY AND SECURITY, AND 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (ICESS), 2015, : 166 - 173
  • [9] Efficient Performance Prediction for Apache Spark
    Cheng, Guoli
    Ying, Shi
    Wang, Bingming
    Li, Yuhang
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 149 : 40 - 51
  • [10] Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark
    Azhir, Elham
    Hosseinzadeh, Mehdi
    Khan, Faheem
    Mosavi, Amir
    [J]. MATHEMATICS, 2022, 10 (19)