A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench

被引:0
|
作者
N. Ahmed
Andre L. C. Barczak
Teo Susnjak
Mohammed A. Rashid
机构
[1] Massey University,School of Natural and Computational Sciences
[2] Massey University,Department of Mechanical and Electrical Engineering
来源
关键词
HiBench; BigData; Hadoop; MapReduce; Benchmark; Spark;
D O I
暂无
中图分类号
学科分类号
摘要
Big Data analytics for storing, processing, and analyzing large-scale datasets has become an essential tool for the industry. The advent of distributed computing frameworks such as Hadoop and Spark offers efficient solutions to analyze vast amounts of data. Due to the application programming interface (API) availability and its performance, Spark becomes very popular, even more popular than the MapReduce framework. Both these frameworks have more than 150 parameters, and the combination of these parameters has a massive impact on cluster performance. The default system parameters help the system administrator deploy their system applications without much effort, and they can measure their specific cluster performance with factory-set parameters. However, an open question remains: can new parameter selection improve cluster performance for large datasets? In this regard, this study investigates the most impacting parameters, under resource utilization, input splits, and shuffle, to compare the performance between Hadoop and Spark, using an implemented cluster in our laboratory. We used a trial-and-error approach for tuning these parameters based on a large number of experiments. In order to evaluate the frameworks of comparative analysis, we select two workloads: WordCount and TeraSort. The performance metrics are carried out based on three criteria: execution time, throughput, and speedup. Our experimental results revealed that both system performances heavily depends on input data size and correct parameter selection. The analysis of the results shows that Spark has better performance as compared to Hadoop when data sets are small, achieving up to two times speedup in WordCount workloads and up to 14 times in TeraSort workloads when default parameter values are reconfigured.
引用
收藏
相关论文
共 50 条
  • [1] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
    Ahmed, N.
    Barczak, Andre L. C.
    Susnjak, Teo
    Rashid, Mohammed A.
    [J]. JOURNAL OF BIG DATA, 2020, 7 (01)
  • [2] Performance Comparison of Apache Hadoop and Apache Spark
    Singh, Amritpal
    Khamparia, Aditya
    Luhach, Ashish Kr
    [J]. PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON ADVANCED INFORMATICS FOR COMPUTING RESEARCH (ICAICR '19), 2019,
  • [3] Performance Evaluation of Query Plan Recommendation with Apache Hadoop and Apache Spark
    Azhir, Elham
    Hosseinzadeh, Mehdi
    Khan, Faheem
    Mosavi, Amir
    [J]. MATHEMATICS, 2022, 10 (19)
  • [4] On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science
    Akil, Bilal
    Zhou, Ying
    Roehm, Uwe
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 303 - 310
  • [5] Large Scale Distributed Data Science using Apache Spark
    Shanahan, James G.
    Dai, Liang
    [J]. KDD'15: PROCEEDINGS OF THE 21ST ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2015, : 2323 - 2324
  • [6] Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark
    Mavridis, Ilias
    Karatza, Helen
    [J]. JOURNAL OF SYSTEMS AND SOFTWARE, 2017, 125 : 133 - 151
  • [7] Apache Spark and Apache Ignite Performance Analysis
    Stan, Cristiana-Stefania
    Pandelica, Adrian-Eduard
    Zamfir, Vlad-Andrei
    Stan, Roxana Gabriela
    Negru, Catalin
    [J]. 2019 22ND INTERNATIONAL CONFERENCE ON CONTROL SYSTEMS AND COMPUTER SCIENCE (CSCS), 2019, : 726 - 733
  • [8] Processing large-scale data with Apache Spark
    Ko, Seyoon
    Won, Joong-Ho
    [J]. KOREAN JOURNAL OF APPLIED STATISTICS, 2016, 29 (06) : 1077 - 1094
  • [9] Large-Scale Data Pollution with Apache Spark
    Hildebrandt, Kai
    Panse, Fabian
    Wilcke, Niklas
    Ritter, Norbert
    [J]. IEEE TRANSACTIONS ON BIG DATA, 2020, 6 (02) : 396 - 411
  • [10] Filter Large-scale Engine Data using Apache Spark
    Pirozzi, Donato
    Scarano, Vittorio
    Begg, Steven
    De Sercey, Guillaume
    Fish, Andrew
    Harvey, Andrew
    [J]. 2016 IEEE 14TH INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS (INDIN), 2016, : 1300 - 1305