Performance comparison of Dask and Apache Spark on HPC systems for neuroimaging

被引:0
|
作者
Dugre, Mathieu [1 ]
Hayot-Sasson, Valerie [1 ]
Glatard, Tristan [1 ]
机构
[1] Concordia Univ, Dept Comp Sci & Software Engn, Montreal, PQ, Canada
来源
基金
加拿大创新基金会; 加拿大自然科学与工程研究理事会;
关键词
Big Data; Dask; HPC; neuroimaging; Spark; performance;
D O I
10.1002/cpe.7635
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
The general increase in data size and data sharing motivates the adoption of Big Data strategies in several scientific disciplines. However, while several options are available, no particular guidelines exist for selecting a Big Data engine. In this paper, we compare the runtime performance of two popular Big Data engines with Python APIs, Apache Spark, and Dask, in processing neuroimaging pipelines. Our experiments use three synthetic neuroimaging applications to process the 606 GB BigBrain image and an actual pipeline to process data from thousands of anatomical images. We benchmark these applications on a dedicated HPC cluster running the Lustre file system while using varying combinations of the number of nodes, file size, and task duration. Our results show that although there are slight differences between Dask and Spark, the performance of the engines is comparable for data-intensive applications. However, Spark requires more memory than Dask, which can lead to slower runtime depending on configuration and infrastructure. In general, the limiting factor was the data transfer time. While both engines are suitable for neuroimaging, more efforts need to be put to reduce the data transfer time and the memory footprint of applications.
引用
下载
收藏
页数:15
相关论文
共 50 条
  • [41] An Enhanced Parallelisation Model for Performance Prediction of Apache Spark on a Multinode Hadoop Cluster
    Ahmed, Nasim
    Barczak, Andre L. C.
    Rashid, Mohammad A.
    Susnjak, Teo
    BIG DATA AND COGNITIVE COMPUTING, 2021, 5 (04)
  • [42] Docker vs. KVM: Apache Spark application performance and ease of use
    Blair, Walter
    Olmsted, Aspen
    Anderson, Paul
    2017 12TH INTERNATIONAL CONFERENCE FOR INTERNET TECHNOLOGY AND SECURED TRANSACTIONS (ICITST), 2017, : 199 - 201
  • [43] Analyzing Performance of Apache Spark MLlib with Multinode Clusters on Azure HDInsight: Spark-Perf Case Study
    Minukhin, Sergii
    Brynza, Natalia
    Sitnikov, Dmytro
    LECTURE NOTES IN COMPUTATIONAL INTELLIGENCE AND DECISION MAKING (ISDMCI 2020), 2020, 1246 : 114 - 134
  • [44] Performance evaluation of intrusion detection based on machine learning using Apache Spark
    Belouch, Mustapha
    El Hadaj, Salah
    Idhammad, Mohamed
    PROCEEDINGS OF THE FIRST INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING IN DATA SCIENCES (ICDS2017), 2018, 127 : 1 - 6
  • [45] Performance Modeling of HPC Applications on Overcommitted Systems
    Minami, Shohei
    Endo, Toshio
    Nomura, Akihiro
    PROCEEDINGS OF INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING IN ASIA-PACIFIC REGION (HPC ASIA 2021), 2020, : 129 - 132
  • [46] A performance comparison of current HPC systems: Blue Gene/Q, Cray XE6 and InfiniBand systems
    Kerbyson, Darren J.
    Barker, Kevin J.
    Vishnu, Abhinav
    Hoisie, Adolfy
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2014, 30 : 291 - 304
  • [47] Performance Comparison of CFD Microbenchmarks on Diverse HPC Architectures
    Galeazzo, Flavio C. C.
    Garcia-Gasulla, Marta
    Boella, Elisabetta
    Pocurull, Josep
    Lesnik, Sergey
    Rusche, Henrik
    Bna, Simone
    Cerminara, Matteo
    Brogi, Federico
    Marchetti, Filippo
    Gregori, Daniele
    Weiss, R. Gregor
    Ruopp, Andreas
    COMPUTERS, 2024, 13 (05)
  • [48] Stocator: Providing High Performance and Fault Tolerance for Apache Spark over Object Storage
    Vernik, Gil
    Factor, Michael
    Kolodner, Elliot K.
    Michiardi, Pietro
    Ofer, Effi
    Pace, Francesco
    2018 18TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING (CCGRID), 2018, : 462 - 471
  • [49] Scalable, High-Performance, and Generalized Subtree Data Anonymization Approach for Apache Spark
    Bazai, Sibghat Ullah
    Jang-Jaccard, Julian
    Alavizadeh, Hooman
    ELECTRONICS, 2021, 10 (05) : 1 - 28
  • [50] Performance Evaluation of Intrusion Detection Streaming Transactions Using Apache Kafka and Spark Streaming
    Tun, May Thet
    Nyaung, Dim En
    Phyu, Myat Pwint
    2019 INTERNATIONAL CONFERENCE ON ADVANCED INFORMATION TECHNOLOGIES (ICAIT), 2019, : 25 - 30