Leveraging resource management for efficient performance of Apache Spark

被引：0

作者：

Khadija Aziz

Dounia Zaidouni

Mostafa Bellafkih

机构：

[1] National Institute of Posts and Telecommunications,STRS Laboratory

来源：

Journal of Big Data | / 6卷

关键词：

Resource management; Performance; Tuning; Distributed data processing; Machine learning algorithms; Apache Spark; MLlib;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Apache Spark is one of the most widely used open source processing framework for big data, it allows to process large datasets in parallel using a large number of nodes. Often, applications of this framework use resource management systems like YARN, which provide jobs a specific amount of resources for their execution. In addition, a distributed file system such as HDFS stores the data that is to be analyzed by the framework. This design allows sharing cluster resources effectively by running jobs on a single-node cluster or multi-nodes cluster infrastructure. Thus, one challenging issue is to realize effective resource management of these large cluster infrastructures in order to run distributed data analytics in an economically viable way. In this study, we use the Machine Learning library (MLlib) of Spark to implement different machine learning algorithms, then we manage the resources (CPU, memory, and Disk) in order to assess the performance of Apache Spark. In this paper, we present a review of various works that focus on resource management and data processing in Big Data platforms. Furthermore, we perform a scalability analysis using Spark. We analyze the speedup and processing time. We deduce that from a certain number of nodes in the cluster, it is no longer necessary to add additional nodes to improve the speedup and the processing Time. Then, we investigate the tuning of the resource allocation in Spark. We showed that it is not only by allocating all the available resources we get better performance but it depends on how to tune the resource allocation. We propose new managed parameters and we show that they give better total processing time than the default parameters used by Spark. Finally, we study the Persistence of Resilient Distributed Datasets (RDDs) in Spark using machine learning algorithms. We show that one storage level gives the best execution time among all tested storage levels.

引用

共 50 条

[31] Performance evaluation of Apache Hadoop and Apache Spark for parallelization of compute-intensive tasks
Doeschl, Alexander
Keller, Max-Emanuel
Mandl, Peter
22ND INTERNATIONAL CONFERENCE ON INFORMATION INTEGRATION AND WEB-BASED APPLICATIONS & SERVICES (IIWAS2020), 2020, : 313 - 321
[32] Performance evaluation of Apache Hadoop and Apache Spark for parallelization of compute-intensive tasks
Döschl, Alexander
Keller, Max-Emanuel
Mandl, Peter
ACM International Conference Proceeding Series, 2020, : 313 - 321
[33] QoS Aware Resource Management for Apache Cassandra
Kishore, Yasaswi
Datta, Venkat N. H.
Subramaniam, K. V.
Sitaram, Dinkar
2016 23RD IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING WORKSHOPS (HIPCW 2016), 2016, : 3 - 10
[34] Spatial data management in apache spark: the GeoSpark perspective and beyond
Jia Yu
Zongsi Zhang
Mohamed Sarwat
GeoInformatica, 2019, 23 : 37 - 78
[35] Spatial data management in apache spark: the GeoSpark perspective and beyond
Yu, Jia
Zhang, Zongsi
Sarwat, Mohamed
GEOINFORMATICA, 2019, 23 (01) : 37 - 78
[36] Statistical Analysis of the Performance of Four Apache Spark ML Algorithms
Camele, Genaro
Hasperue, Waldo
Ronchetti, Franco
Quiroga, Facundo Manuel
JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY, 2022, 22 (02): : 175 - 182
[37] Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark
Mavridis, Ilias
Karatza, Helen
JOURNAL OF SYSTEMS AND SOFTWARE, 2017, 125 : 133 - 151
[38] Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark
Xu, Bo
Li, Changlong
Zhuang, Hang
Wang, Jiali
Wang, Qingfeng
Zhou, Xuehai
2017 IEEE 10TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD), 2017, : 608 - 615
[39] Efficient Matrix Computation for SGD-Based Algorithms on Apache Spark
Han, Baokun
Chen, Zihao
Xu, Chen
Zhou, Aoying
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2022, PT I, 2022, : 309 - 324
[40] GeoMatch: Efficient Large-Scale Map Matching on Apache Spark
Zeidan, Ayman
Lagerspetz, Eemil
Zhao, Kai
Nurmi, Petteri
Tarkoma, Sasu
Vo, Huy T.
2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 384 - 391

← 1 2 3 4 5 →