A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications

被引:30
|
作者
Tang, Shanjiang [1 ]
He, Bingsheng [2 ]
Yu, Ce [1 ]
Li, Yusen [3 ]
Li, Kun [1 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin 300072, Peoples R China
[2] Natl Univ Singapore, Sch Comp, Singapore 119077, Singapore
[3] Nankai Univ, Sch Comp, Tianjin 300071, Peoples R China
基金
中国国家自然科学基金;
关键词
Spark; shark; RDD; in-memory data processing; DATA PROVENANCE SUPPORT; DATA-MANAGEMENT;
D O I
10.1109/TKDE.2020.2975652
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the explosive increase of big data in industry and academic fields, it is important to apply large-scale data processing systems to analyze Big Data. Arguably, Spark is the state-of-the-art in large-scale data computing systems nowadays, due to its good properties including generality, fault tolerance, high performance of in-memory data processing, and scalability. Spark adopts a flexible Resident Distributed Dataset (RDD) programming model with a set of provided transformation and action operators whose operating functions can be customized by users according to their applications. It is originally positioned as a fast and general data processing system. A large body of research efforts have been made to make it more efficient (faster) and general by considering various circumstances since its introduction. In this survey, we aim to have a thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark. We introduce Spark programming model and computing system, discuss the pros and cons of Spark, and have an investigation and classification of various solving techniques in the literature. Moreover, we also introduce various data management and processing systems, machine learning algorithms and applications supported by Spark. Finally, we make a discussion on the open issues and challenges for large-scale in-memory data processing with Spark.
引用
收藏
页码:71 / 91
页数:21
相关论文
共 50 条
  • [21] Big data execution time based on Spark Machine Learning Libraries
    Garate-Escamilla, Anna Karen
    Hajjam El Hassani, Amir
    Andres, Emmanuel
    [J]. PROCEEDINGS OF 2019 3RD INTERNATIONAL CONFERENCE ON CLOUD AND BIG DATA COMPUTING (ICCBDC 2019), 2019, : 78 - 83
  • [22] Big data Predictive Analytics for Apache Spark using Machine Learning
    Junaid, Muhammad
    Wagan, Shiraz Ali
    Qureshi, Nawab Muhammad Faseeh
    Nam, Choon Sung
    Shin, Dong Ryeol
    [J]. 2020 GLOBAL CONFERENCE ON WIRELESS AND OPTICAL TECHNOLOGIES (GCWOT), 2020,
  • [23] A Research Study on Running Machine Learning Algorithms on Big Data with Spark
    Kerestely, Arpad
    Baicoianu, Alexandra
    Bocu, Razvan
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT I, 2021, 12815 : 307 - 318
  • [24] SMBSP: A Self-Tuning Approach using Machine Learning to Improve Performance of Spark in Big Data Processing
    Rahman, Md. Armanur
    Hossen, J.
    Venkataseshaiah, C.
    [J]. PROCEEDINGS OF THE 2018 7TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION ENGINEERING (ICCCE), 2018, : 274 - 279
  • [25] Social Media Data Processing Infrastructure by Using Apache Spark Big Data Platform: Twitter Data Analysis
    Podhoranyi, Michal
    Vojacek, Lukas
    [J]. 2019 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTERNET OF THINGS (CCIOT 2019), 2019, : 1 - 6
  • [26] Advanced Machine Learning Applications in Big Data Analytics
    Li, Taiyong
    Deng, Wu
    Wu, Jiang
    [J]. ELECTRONICS, 2023, 12 (13)
  • [27] Current applications of big data and machine learning in cardiology
    Renato Cuocolo
    Teresa Perillo
    Eliana De Rosa
    Lorenzo Ugga
    Mario Petretta
    [J]. Journal of Geriatric Cardiology, 2019, 16 (08) : 601 - 607
  • [28] Current applications of big data and machine learning in cardiology
    Cuocolo, Renato
    Perillo, Teresa
    De Rosa, Eliana
    Ugga, Lorenzo
    Petretta, Mario
    [J]. JOURNAL OF GERIATRIC CARDIOLOGY, 2019, 16 (08) : 601 - 607
  • [29] Spark Based Distributed Deep Learning Framework For Big Data Applications
    Khumoyun, Akhmedov
    Cui, Yun
    Hanku, Lee
    [J]. 2016 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND COMMUNICATIONS TECHNOLOGIES (ICISCT), 2016,
  • [30] Computing infrastructure for big data processing
    Liu, Ling
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2013, 7 (02) : 165 - 170