A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications

被引:30
|
作者
Tang, Shanjiang [1 ]
He, Bingsheng [2 ]
Yu, Ce [1 ]
Li, Yusen [3 ]
Li, Kun [1 ]
机构
[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin 300072, Peoples R China
[2] Natl Univ Singapore, Sch Comp, Singapore 119077, Singapore
[3] Nankai Univ, Sch Comp, Tianjin 300071, Peoples R China
基金
中国国家自然科学基金;
关键词
Spark; shark; RDD; in-memory data processing; DATA PROVENANCE SUPPORT; DATA-MANAGEMENT;
D O I
10.1109/TKDE.2020.2975652
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the explosive increase of big data in industry and academic fields, it is important to apply large-scale data processing systems to analyze Big Data. Arguably, Spark is the state-of-the-art in large-scale data computing systems nowadays, due to its good properties including generality, fault tolerance, high performance of in-memory data processing, and scalability. Spark adopts a flexible Resident Distributed Dataset (RDD) programming model with a set of provided transformation and action operators whose operating functions can be customized by users according to their applications. It is originally positioned as a fast and general data processing system. A large body of research efforts have been made to make it more efficient (faster) and general by considering various circumstances since its introduction. In this survey, we aim to have a thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark. We introduce Spark programming model and computing system, discuss the pros and cons of Spark, and have an investigation and classification of various solving techniques in the literature. Moreover, we also introduce various data management and processing systems, machine learning algorithms and applications supported by Spark. Finally, we make a discussion on the open issues and challenges for large-scale in-memory data processing with Spark.
引用
收藏
页码:71 / 91
页数:21
相关论文
共 50 条
  • [41] Machine Learning Algorithms for Big Data Applications With Policy Implementation
    Wu, Jianzu
    Zhang, Kunxin
    [J]. JOURNAL OF ORGANIZATIONAL AND END USER COMPUTING, 2022, 34 (03)
  • [42] Machine learning and big data in psychiatry: toward clinical applications
    Rutledge, Robb B.
    Chekroud, Adam M.
    Huys, Quentin J. M.
    [J]. CURRENT OPINION IN NEUROBIOLOGY, 2019, 55 : 152 - 159
  • [43] Parallelizing Big Data Machine Learning Applications with Model Rotation
    Zhang, Bingjing
    Peng, Bo
    Qiu, Judy
    [J]. NEW FRONTIERS IN HIGH PERFORMANCE COMPUTING AND BIG DATA, 2017, 30 : 199 - 213
  • [44] Tension in big data using machine learning: Analysis and applications
    Wang, Huamao
    Yao, Yumei
    Salhi, Said
    [J]. TECHNOLOGICAL FORECASTING AND SOCIAL CHANGE, 2020, 158
  • [45] Big data analytics deep learning techniques and applications: A survey
    Selmy, Hend A.
    Mohamed, Hoda K.
    Medhat, Walaa
    [J]. INFORMATION SYSTEMS, 2024, 120
  • [46] Distributed Nonlinear Semiparametric Support Vector Machine for Big Data Applications on Spark Frameworks
    Díaz-Morales, Roberto
    Navia-Vázquez, Ángel
    [J]. Díaz-Morales, Roberto (roberto.diaz@treelogic.com), 1600, Institute of Electrical and Electronics Engineers Inc., United States (50): : 4664 - 4675
  • [47] Distributed Nonlinear Semiparametric Support Vector Machine for Big Data Applications on Spark Frameworks
    Diaz-Morales, Roberto
    Navia-Vazquez, Angel
    [J]. IEEE TRANSACTIONS ON SYSTEMS MAN CYBERNETICS-SYSTEMS, 2020, 50 (11): : 4664 - 4675
  • [48] Smart Cyber Infrastructure for Big Data processing
    Makkes, Marc X.
    Cushing, Reginald
    Oprescu, Ana-Maria
    Koning, Ralph
    Grosso, Paola
    Meijer, Robert
    de laat, Cees
    [J]. 2014 OPTICAL FIBER COMMUNICATIONS CONFERENCE AND EXHIBITION (OFC), 2014,
  • [49] 3D point cloud data processing with machine learning for construction and infrastructure applications: A comprehensive review
    Mirzaei, Kaveh
    Arashpour, Mehrdad
    Asadi, Ehsan
    Masoumi, Hossein
    Bai, Yu
    Behnood, Ali
    [J]. ADVANCED ENGINEERING INFORMATICS, 2022, 51
  • [50] Research on Visual Machine Learning Algorithms Based on Apache Spark in Big Data Environment
    Wang, Jialin
    [J]. BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2019, 124 : 144 - 144