A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications

被引：30

作者：

Tang, Shanjiang ^{[1
]}

He, Bingsheng ^{[2
]}

Yu, Ce ^{[1
]}

Li, Yusen ^{[3
]}

Li, Kun ^{[1
]}

机构：

[1] Tianjin Univ, Coll Intelligence & Comp, Tianjin 300072, Peoples R China

[2] Natl Univ Singapore, Sch Comp, Singapore 119077, Singapore

[3] Nankai Univ, Sch Comp, Tianjin 300071, Peoples R China

来源：

IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING | 2022年 / 34卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Spark; shark; RDD; in-memory data processing; DATA PROVENANCE SUPPORT; DATA-MANAGEMENT;

D O I：

10.1109/TKDE.2020.2975652

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With the explosive increase of big data in industry and academic fields, it is important to apply large-scale data processing systems to analyze Big Data. Arguably, Spark is the state-of-the-art in large-scale data computing systems nowadays, due to its good properties including generality, fault tolerance, high performance of in-memory data processing, and scalability. Spark adopts a flexible Resident Distributed Dataset (RDD) programming model with a set of provided transformation and action operators whose operating functions can be customized by users according to their applications. It is originally positioned as a fast and general data processing system. A large body of research efforts have been made to make it more efficient (faster) and general by considering various circumstances since its introduction. In this survey, we aim to have a thorough review of various kinds of optimization techniques on the generality and performance improvement of Spark. We introduce Spark programming model and computing system, discuss the pros and cons of Spark, and have an investigation and classification of various solving techniques in the literature. Moreover, we also introduce various data management and processing systems, machine learning algorithms and applications supported by Spark. Finally, we make a discussion on the open issues and challenges for large-scale in-memory data processing with Spark.

引用

页码：71 / 91

页数：21

共 50 条

[21] Big data execution time based on Spark Machine Learning Libraries
Garate-Escamilla, Anna Karen
Hajjam El Hassani, Amir
Andres, Emmanuel
[J]. PROCEEDINGS OF 2019 3RD INTERNATIONAL CONFERENCE ON CLOUD AND BIG DATA COMPUTING (ICCBDC 2019), 2019, : 78 - 83
[22] Big data Predictive Analytics for Apache Spark using Machine Learning
Junaid, Muhammad
Wagan, Shiraz Ali
Qureshi, Nawab Muhammad Faseeh
Nam, Choon Sung
Shin, Dong Ryeol
[J]. 2020 GLOBAL CONFERENCE ON WIRELESS AND OPTICAL TECHNOLOGIES (GCWOT), 2020,
[23] A Research Study on Running Machine Learning Algorithms on Big Data with Spark
Kerestely, Arpad
Baicoianu, Alexandra
Bocu, Razvan
[J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT I, 2021, 12815 : 307 - 318
[24] SMBSP: A Self-Tuning Approach using Machine Learning to Improve Performance of Spark in Big Data Processing
Rahman, Md. Armanur
Hossen, J.
Venkataseshaiah, C.
[J]. PROCEEDINGS OF THE 2018 7TH INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION ENGINEERING (ICCCE), 2018, : 274 - 279
[25] Social Media Data Processing Infrastructure by Using Apache Spark Big Data Platform: Twitter Data Analysis
Podhoranyi, Michal
Vojacek, Lukas
[J]. 2019 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND INTERNET OF THINGS (CCIOT 2019), 2019, : 1 - 6
[26] Advanced Machine Learning Applications in Big Data Analytics
Li, Taiyong
Deng, Wu
Wu, Jiang
[J]. ELECTRONICS, 2023, 12 (13)
[27] Current applications of big data and machine learning in cardiology
Renato Cuocolo
Teresa Perillo
Eliana De Rosa
Lorenzo Ugga
Mario Petretta
[J]. Journal of Geriatric Cardiology, 2019, 16 (08) : 601 - 607
[28] Current applications of big data and machine learning in cardiology
Cuocolo, Renato
Perillo, Teresa
De Rosa, Eliana
Ugga, Lorenzo
Petretta, Mario
[J]. JOURNAL OF GERIATRIC CARDIOLOGY, 2019, 16 (08) : 601 - 607
[29] Spark Based Distributed Deep Learning Framework For Big Data Applications
Khumoyun, Akhmedov
Cui, Yun
Hanku, Lee
[J]. 2016 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND COMMUNICATIONS TECHNOLOGIES (ICISCT), 2016,
[30] Computing infrastructure for big data processing
Liu, Ling
[J]. FRONTIERS OF COMPUTER SCIENCE, 2013, 7 (02) : 165 - 170

← 1 2 3 4 5 →