Leveraging resource management for efficient performance of Apache Spark

被引:0
|
作者
Khadija Aziz
Dounia Zaidouni
Mostafa Bellafkih
机构
[1] National Institute of Posts and Telecommunications,STRS Laboratory
来源
关键词
Resource management; Performance; Tuning; Distributed data processing; Machine learning algorithms; Apache Spark; MLlib;
D O I
暂无
中图分类号
学科分类号
摘要
Apache Spark is one of the most widely used open source processing framework for big data, it allows to process large datasets in parallel using a large number of nodes. Often, applications of this framework use resource management systems like YARN, which provide jobs a specific amount of resources for their execution. In addition, a distributed file system such as HDFS stores the data that is to be analyzed by the framework. This design allows sharing cluster resources effectively by running jobs on a single-node cluster or multi-nodes cluster infrastructure. Thus, one challenging issue is to realize effective resource management of these large cluster infrastructures in order to run distributed data analytics in an economically viable way. In this study, we use the Machine Learning library (MLlib) of Spark to implement different machine learning algorithms, then we manage the resources (CPU, memory, and Disk) in order to assess the performance of Apache Spark. In this paper, we present a review of various works that focus on resource management and data processing in Big Data platforms. Furthermore, we perform a scalability analysis using Spark. We analyze the speedup and processing time. We deduce that from a certain number of nodes in the cluster, it is no longer necessary to add additional nodes to improve the speedup and the processing Time. Then, we investigate the tuning of the resource allocation in Spark. We showed that it is not only by allocating all the available resources we get better performance but it depends on how to tune the resource allocation. We propose new managed parameters and we show that they give better total processing time than the default parameters used by Spark. Finally, we study the Persistence of Resilient Distributed Datasets (RDDs) in Spark using machine learning algorithms. We show that one storage level gives the best execution time among all tested storage levels.
引用
收藏
相关论文
共 50 条
  • [1] Leveraging resource management for efficient performance of Apache Spark
    Aziz, Khadija
    Zaidouni, Dounia
    Bellafkih, Mostafa
    JOURNAL OF BIG DATA, 2019, 6 (01)
  • [2] Efficient Performance Prediction for Apache Spark
    Cheng, Guoli
    Ying, Shi
    Wang, Bingming
    Li, Yuhang
    JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2021, 149 : 40 - 51
  • [3] Apache Spark and Apache Ignite Performance Analysis
    Stan, Cristiana-Stefania
    Pandelica, Adrian-Eduard
    Zamfir, Vlad-Andrei
    Stan, Roxana Gabriela
    Negru, Catalin
    2019 22ND INTERNATIONAL CONFERENCE ON CONTROL SYSTEMS AND COMPUTER SCIENCE (CSCS), 2019, : 726 - 733
  • [4] Performance Comparison of Apache Hadoop and Apache Spark
    Singh, Amritpal
    Khamparia, Aditya
    Luhach, Ashish Kr
    PROCEEDINGS OF THE THIRD INTERNATIONAL CONFERENCE ON ADVANCED INFORMATICS FOR COMPUTING RESEARCH (ICAICR '19), 2019,
  • [5] SparkScore: Leveraging Apache Spark for Distributed Genomic Inference
    Bahmani, Amir
    Sibley, Alexander B.
    Parsian, Mahmoud
    Owzar, Kouros
    Mueller, Frank
    2016 IEEE 30TH INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2016, : 435 - 442
  • [6] Integration of Apache Spark with Invasive Resource Manager
    Chacko, Jeeta Ann
    Urena, Isaias A. Compres
    Gerndt, Michael
    2019 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTING, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI 2019), 2019, : 1553 - 1560
  • [7] Performance Prediction for Apache Spark Platform
    Wang, Kewen
    Khan, Mohammad Maifi Hasan
    2015 IEEE 17TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING AND COMMUNICATIONS, 2015 IEEE 7TH INTERNATIONAL SYMPOSIUM ON CYBERSPACE SAFETY AND SECURITY, AND 2015 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED SOFTWARE AND SYSTEMS (ICESS), 2015, : 166 - 173
  • [9] Efficient distributed SPARQL queries on Apache Spark
    Albahli S.
    International Journal of Advanced Computer Science and Applications, 2019, 10 (08): : 564 - 568
  • [10] Efficient Incremental Data Analytics with Apache Spark
    Gholamian, Sina
    Golab, Wojciech
    Ward, Paul A. S.
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 2859 - 2868