Leveraging resource management for efficient performance of Apache Spark

被引：0

作者：

Khadija Aziz

Dounia Zaidouni

Mostafa Bellafkih

机构：

[1] National Institute of Posts and Telecommunications,STRS Laboratory

来源：

Journal of Big Data | / 6卷

关键词：

Resource management; Performance; Tuning; Distributed data processing; Machine learning algorithms; Apache Spark; MLlib;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Apache Spark is one of the most widely used open source processing framework for big data, it allows to process large datasets in parallel using a large number of nodes. Often, applications of this framework use resource management systems like YARN, which provide jobs a specific amount of resources for their execution. In addition, a distributed file system such as HDFS stores the data that is to be analyzed by the framework. This design allows sharing cluster resources effectively by running jobs on a single-node cluster or multi-nodes cluster infrastructure. Thus, one challenging issue is to realize effective resource management of these large cluster infrastructures in order to run distributed data analytics in an economically viable way. In this study, we use the Machine Learning library (MLlib) of Spark to implement different machine learning algorithms, then we manage the resources (CPU, memory, and Disk) in order to assess the performance of Apache Spark. In this paper, we present a review of various works that focus on resource management and data processing in Big Data platforms. Furthermore, we perform a scalability analysis using Spark. We analyze the speedup and processing time. We deduce that from a certain number of nodes in the cluster, it is no longer necessary to add additional nodes to improve the speedup and the processing Time. Then, we investigate the tuning of the resource allocation in Spark. We showed that it is not only by allocating all the available resources we get better performance but it depends on how to tune the resource allocation. We propose new managed parameters and we show that they give better total processing time than the default parameters used by Spark. Finally, we study the Persistence of Resilient Distributed Datasets (RDDs) in Spark using machine learning algorithms. We show that one storage level gives the best execution time among all tested storage levels.

引用

共 50 条

[1] Leveraging resource management for efficient performance of Apache Spark
Aziz, Khadija
Zaidouni, Dounia
Bellafkih, Mostafa
JOURNAL OF BIG DATA, 2019, 6 (01)
[2] Efficient Performance Prediction for Apache Spark
Cheng, Guoli
Ying, Shi
Wang, Bingming
Li, Yuhang
JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 149 : 40 - 51
[3] Apache Spark and Apache Ignite Performance Analysis
Stan, Cristiana-Stefania
Pandelica, Adrian-Eduard
Zamfir, Vlad-Andrei
Stan, Roxana Gabriela
Negru, Catalin
2019 22ND INTERNATIONAL CONFERENCE ON CONTROL SYSTEMS AND COMPUTER SCIENCE (CSCS), 2019, : 726 - 733
[4] Performance Comparison of Apache Hadoop and Apache Spark
Singh, Amritpal
Khamparia, Aditya
Luhach, Ashish Kr
PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON ADVANCED INFORMATICS FOR COMPUTING RESEARCH (ICAICR '19), 2019,
[5] SparkScore: Leveraging Apache Spark for Distributed Genomic Inference
Bahmani, Amir
Sibley, Alexander B.
Parsian, Mahmoud
Owzar, Kouros
Mueller, Frank
2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 435 - 442
[6] Integration of Apache Spark with Invasive Resource Manager
Chacko, Jeeta Ann
Urena, Isaias A. Compres
Gerndt, Michael
2019 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI 2019), 2019, : 1553 - 1560
[7] Performance Prediction for Apache Spark Platform
Wang, Kewen
Khan, Mohammad Maifi Hasan
2015 IEEE 17TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2015 IEEE 7TH INTERNATIONAL SYMPOSIUM ON CYBERSPACE SAFETY AND SECURITY, AND 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (ICESS), 2015, : 166 - 173
[8] Efficient Distributed SPARQL Queries on Apache Spark
Albahli, Saleh
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2019, 10 (08) : 564 - 568
[9] Efficient distributed SPARQL queries on Apache Spark
Albahli S.
International Journal of Advanced Computer Science and Applications, 2019, 10 (08): : 564 - 568
[10] Efficient Incremental Data Analytics with Apache Spark
Gholamian, Sina
Golab, Wojciech
Ward, Paul A. S.
2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 2859 - 2868

← 1 2 3 4 5 →