A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications

被引:30
|
作者
Tang, Shanjiang [1 ]
He, Bingsheng [2 ]
Yu, Ce [1 ]
Li, Yusen [3 ]
Li, Kun [1 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin 300072, Peoples R China
[2] Natl Univ Singapore, Sch Comp, Singapore 119077, Singapore
[3] Nankai Univ, Sch Comp, Tianjin 300071, Peoples R China
基金
中国国家自然科学基金;
关键词
Spark; shark; RDD; in-memory data processing; DATA PROVENANCE SUPPORT; DATA-MANAGEMENT;
D O I
10.1109/TKDE.2020.2975652
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the explosive increase of big data in industry and academic fields, it is important to apply large-scale data processing systems to analyze Big Data. Arguably, Spark is the state-of-the-art in large-scale data computing systems nowadays, due to its good properties including generality, fault tolerance, high performance of in-memory data processing, and scalability. Spark adopts a flexible Resident Distributed Dataset (RDD) programming model with a set of provided transformation and action operators whose operating functions can be customized by users according to their applications. It is originally positioned as a fast and general data processing system. A large body of research efforts have been made to make it more efficient (faster) and general by considering various circumstances since its introduction. In this survey, we aim to have a thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark. We introduce Spark programming model and computing system, discuss the pros and cons of Spark, and have an investigation and classification of various solving techniques in the literature. Moreover, we also introduce various data management and processing systems, machine learning algorithms and applications supported by Spark. Finally, we make a discussion on the open issues and challenges for large-scale in-memory data processing with Spark.
引用
收藏
页码:71 / 91
页数:21
相关论文
共 50 条
  • [1] A survey of machine learning for big data processing
    Qiu, Junfei
    Wu, Qihui
    Ding, Guoru
    Xu, Yuhua
    Feng, Shuo
    [J]. EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2016,
  • [2] A survey of machine learning for big data processing
    Junfei Qiu
    Qihui Wu
    Guoru Ding
    Yuhua Xu
    Shuo Feng
    [J]. EURASIP Journal on Advances in Signal Processing, 2016
  • [3] SPARK-A Big Data Processing Platform for Machine Learning
    Fu, Jian
    Sun, Junwei
    Wang, Kaiyuan
    [J]. 2016 2ND INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS - COMPUTING TECHNOLOGY, INTELLIGENT TECHNOLOGY, INDUSTRIAL INFORMATION INTEGRATION (ICIICII), 2016, : 48 - 51
  • [4] Erratum to: A survey of machine learning for big data processing
    Junfei Qiu
    Qihui Wu
    Guoru Ding
    Yuhua Xu
    Shuo Feng
    [J]. EURASIP Journal on Advances in Signal Processing, 2016
  • [5] Survey of Machine Learning Methods for Big Data Applications
    Vinothini, A.
    Priya, S. Baghavathi
    [J]. 2017 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE IN DATA SCIENCE (ICCIDS), 2017,
  • [6] Big data processing with Apache Spark in university institutions: spark streaming and machine learning algorithm
    Boachie, Emmanuel
    Li, Chunlin
    [J]. INTERNATIONAL JOURNAL OF CONTINUING ENGINEERING EDUCATION AND LIFE-LONG LEARNING, 2019, 29 (1-2) : 5 - 20
  • [7] A survey of open source tools for machine learning with big data in the Hadoop ecosystem
    Landset S.
    Khoshgoftaar T.M.
    Richter A.N.
    Hasanin T.
    [J]. Journal of Big Data, 2 (1)
  • [8] A SURVEY ON BIG DATA: INFRASTRUCTURE, ANALYTICS, VISUALIZATION AND APPLICATIONS
    Saraswathi, S.
    Deepa, G.
    Vennila, G.
    Parthasarathy, S.
    Ramadoss, B.
    [J]. INTERNATIONAL JOURNAL OF INDUSTRIAL ENGINEERING-THEORY APPLICATIONS AND PRACTICE, 2022, 29 (05): : 618 - 648
  • [9] A survey of machine learning for big data processing (vol 2016, 67, 2016)
    Qiu, Junfei
    Wu, Qihui
    Ding, Guoru
    Xu, Yuhua
    Feng, Shuo
    [J]. EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2016,
  • [10] A review of machine learning and big data applications in addressing ecosystem service research gaps
    Manley, Kyle
    Nyelele, Charity
    Egoh, Benis N.
    [J]. ECOSYSTEM SERVICES, 2022, 57