On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

被引:0
|
作者
Akil, Bilal [1 ]
Zhou, Ying [1 ]
Roehm, Uwe [1 ]
机构
[1] Univ Sydney, Sydney, NSW, Australia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level, requiring many implementation steps even for simple analysis tasks. This has led to the development of more advanced dataflow oriented platforms, most prominently Apache Spark and Apache Flink. Those platforms not only aim to improve performance through improved in-memory processing, but in particular provide built-in high-level data processing functionality, such as filtering and join operators, which should make data analysis tasks easier to develop than with plain Hadoop MapReduce. But is this indeed the case? This paper compares three prominent distributed data processing platforms: Apache Hadoop MapReduce; Apache Spark; and Apache Flink, from a usability perspective. We report on the design, execution and results of a usability study with a cohort of master students, who were learning and working with all three platforms in order to solve different use cases set in a data science context. Our findings show that Spark and Flink are preferred platforms over MapReduce. Among participants, there was no significant difference in perceived preference or development time between both Spark and Flink as platforms for batch-oriented big data analysis. This study starts an exploration of the factors that make Big Data platforms more - or less - effective for users in data science.
引用
收藏
页码:303 / 310
页数:8
相关论文
共 50 条
  • [31] Analyzing performance of Apache Tez and MapReduce with hadoop multinode cluster on Amazon cloud
    Singh R.
    Kaur P.J.
    [J]. Journal of Big Data, 3 (1)
  • [32] Towards autoscaling of Apache Flink jobs
    Varga, Balazs
    Balassi, Marton
    Kiss, Attila
    [J]. ACTA UNIVERSITATIS SAPIENTIAE INFORMATICA, 2021, 13 (01) : 39 - 59
  • [33] Real-time Data Streaming using Apache Spark on Fully Configured Hadoop Cluster
    Prasad, Kashi Sai
    Pasupathy, S.
    [J]. JOURNAL OF MECHANICS OF CONTINUA AND MATHEMATICAL SCIENCES, 2018, 13 (05): : 164 - 176
  • [34] Big data classification using heterogeneous ensemble classifiers in Apache Spark based on MapReduce paradigm
    Kadkhodaei, Hamidreza
    Moghadam, Amir Masoud Eftekhari
    Dehghan, Mehdi
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2021, 183
  • [35] Development of a Network Intrusion Detection System Using Apache Hadoop and Spark
    Kato, Keisuke
    Klyuev, Vitaly
    [J]. 2017 IEEE CONFERENCE ON DEPENDABLE AND SECURE COMPUTING, 2017, : 416 - 423
  • [36] Time Estimation and Resource Minimization Scheme for Apache Spark and Hadoop Big Data Systems With Failures
    Lee, Jinbae
    Kim, Bobae
    Chung, Jong-Moon
    [J]. IEEE ACCESS, 2019, 7 : 9658 - 9666
  • [37] MapReduce accelerated attribute reduction based on neighborhood entropy with Apache Spark
    Luo, Chuan
    Cao, Qian
    Li, Tianrui
    Chen, Hongmei
    Wang, Sizhao
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 211
  • [38] Big Spatial Data Processing With Apache Spark
    Boyi Shangguan
    Peng Yue
    Wu, Zhaoyan
    Jiang, Liangcun
    [J]. 2017 6TH INTERNATIONAL CONFERENCE ON AGRO-GEOINFORMATICS, 2017, : 239 - 242
  • [39] Shared Disk Big Data Analytics with Apache Hadoop
    Mukherjee, Anirban
    Datta, Joydip
    Jorapur, Raghavendra
    Singhvi, Ravi
    Haloi, Saurav
    Akram, Wasim
    [J]. 2012 19TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2012,
  • [40] An Efficient Topology Refining Scheme for Apache Flink
    Hanif, Muhammad
    Lee, Choonhwa
    [J]. 2018 INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY CONVERGENCE (ICTC), 2018, : 766 - 770