On the Usability of Hadoop MapReduce, Apache Spark & Apache Flink for Data Science

被引:0
|
作者
Akil, Bilal [1 ]
Zhou, Ying [1 ]
Roehm, Uwe [1 ]
机构
[1] Univ Sydney, Sydney, NSW, Australia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Distributed data processing platforms for cloud computing are important tools for large-scale data analytics. Apache Hadoop MapReduce has become the de facto standard in this space, though its programming interface is relatively low-level, requiring many implementation steps even for simple analysis tasks. This has led to the development of more advanced dataflow oriented platforms, most prominently Apache Spark and Apache Flink. Those platforms not only aim to improve performance through improved in-memory processing, but in particular provide built-in high-level data processing functionality, such as filtering and join operators, which should make data analysis tasks easier to develop than with plain Hadoop MapReduce. But is this indeed the case? This paper compares three prominent distributed data processing platforms: Apache Hadoop MapReduce; Apache Spark; and Apache Flink, from a usability perspective. We report on the design, execution and results of a usability study with a cohort of master students, who were learning and working with all three platforms in order to solve different use cases set in a data science context. Our findings show that Spark and Flink are preferred platforms over MapReduce. Among participants, there was no significant difference in perceived preference or development time between both Spark and Flink as platforms for batch-oriented big data analysis. This study starts an exploration of the factors that make Big Data platforms more - or less - effective for users in data science.
引用
收藏
页码:303 / 310
页数:8
相关论文
共 50 条
  • [41] CMS Analysis and Data Reduction with Apache Spark
    Gutsche, Oliver
    Canali, Luca
    Cremer, Illia
    Cremonesi, Matteo
    Elmer, Peter
    Fisk, Ian
    Girone, Maria
    Jayatilaka, Bo
    Kowalkowski, Jim
    Khristenko, Viktor
    Motesnitsalis, Evangelos
    Pivarski, Jim
    Sehrish, Saba
    Surdy, Kacper
    Svyatkovskiy, Alexey
    [J]. 18TH INTERNATIONAL WORKSHOP ON ADVANCED COMPUTING AND ANALYSIS TECHNIQUES IN PHYSICS RESEARCH (ACAT2017), 2018, 1085
  • [42] Shared Disk Big Data Analytics with Apache Hadoop
    Mukherjee, Anirban
    Datta, Joydip
    Jorapur, Raghavendra
    Singhvi, Ravi
    Haloi, Saurav
    Akram, Wasim
    [J]. 2012 19TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2012,
  • [43] Evaluation of Apache Hadoop for parallel data analysis with ROOT
    Lehrack, S.
    Duckeck, G.
    Ebke, J.
    [J]. 20TH INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS (CHEP2013), PARTS 1-6, 2014, 513
  • [44] HYAS: Hybrid Autoscaler Agent for Apache Flink
    Zafeirakopoulos, Alexandros Nikolaos
    Petrakis, Euripides G. M.
    [J]. WEB ENGINEERING, ICWE 2023, 2023, 13893 : 34 - 48
  • [45] Code Generation in Serializers and Comparators of Apache Flink
    Horvath, Gabor
    Pataki, Norbert
    Balassi, Marton
    [J]. PROCEEDINGS OF THE 12TH WORKSHOP ON IMPLEMENTATION, COMPILATION AND OPTIMIZATION OF OBJECT-ORIENTED LANGUAGES, PROGRAMS AND SYSTEMS (ICOOOLPS'17), 2017,
  • [46] Efficient Incremental Data Analytics with Apache Spark
    Gholamian, Sina
    Golab, Wojciech
    Ward, Paul A. S.
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 2859 - 2868
  • [47] Adding data provenance support to Apache Spark
    Matteo Interlandi
    Ari Ekmekji
    Kshitij Shah
    Muhammad Ali Gulzar
    Sai Deep Tetali
    Miryung Kim
    Todd Millstein
    Tyson Condie
    [J]. The VLDB Journal, 2018, 27 : 595 - 615
  • [48] Adding data provenance support to Apache Spark
    Interlandi, Matteo
    Ekmekji, Ari
    Shah, Kshitij
    Gulzar, Muhammad Ali
    Tetali, Sai Deep
    Kim, Miryung
    Millstein, Todd
    Condie, Tyson
    [J]. VLDB JOURNAL, 2018, 27 (05): : 595 - 615
  • [49] Data Preparation as a Service Based on Apache Spark
    Mahasivam, Nivethika
    Nikolov, Nikolay
    Sukhobok, Dina
    Roman, Dumitru
    [J]. SERVICE-ORIENTED AND CLOUD COMPUTING (ESOCC 2017), 2017, 10465 : 125 - 139
  • [50] BigBench workload executed by using Apache Flink
    Bergamaschi, Sonia
    Gagliardelli, Luca
    Simonini, Giovanni
    Zhu, Song
    [J]. 27TH INTERNATIONAL CONFERENCE ON FLEXIBLE AUTOMATION AND INTELLIGENT MANUFACTURING, FAIM2017, 2017, 11 : 695 - 702