Leveraging resource management for efficient performance of Apache Spark

被引:0
|
作者
Khadija Aziz
Dounia Zaidouni
Mostafa Bellafkih
机构
[1] National Institute of Posts and Telecommunications,STRS Laboratory
来源
关键词
Resource management; Performance; Tuning; Distributed data processing; Machine learning algorithms; Apache Spark; MLlib;
D O I
暂无
中图分类号
学科分类号
摘要
Apache Spark is one of the most widely used open source processing framework for big data, it allows to process large datasets in parallel using a large number of nodes. Often, applications of this framework use resource management systems like YARN, which provide jobs a specific amount of resources for their execution. In addition, a distributed file system such as HDFS stores the data that is to be analyzed by the framework. This design allows sharing cluster resources effectively by running jobs on a single-node cluster or multi-nodes cluster infrastructure. Thus, one challenging issue is to realize effective resource management of these large cluster infrastructures in order to run distributed data analytics in an economically viable way. In this study, we use the Machine Learning library (MLlib) of Spark to implement different machine learning algorithms, then we manage the resources (CPU, memory, and Disk) in order to assess the performance of Apache Spark. In this paper, we present a review of various works that focus on resource management and data processing in Big Data platforms. Furthermore, we perform a scalability analysis using Spark. We analyze the speedup and processing time. We deduce that from a certain number of nodes in the cluster, it is no longer necessary to add additional nodes to improve the speedup and the processing Time. Then, we investigate the tuning of the resource allocation in Spark. We showed that it is not only by allocating all the available resources we get better performance but it depends on how to tune the resource allocation. We propose new managed parameters and we show that they give better total processing time than the default parameters used by Spark. Finally, we study the Persistence of Resilient Distributed Datasets (RDDs) in Spark using machine learning algorithms. We show that one storage level gives the best execution time among all tested storage levels.
引用
收藏
相关论文
共 50 条
  • [31] Performance evaluation of Apache Hadoop and Apache Spark for parallelization of compute-intensive tasks
    Doeschl, Alexander
    Keller, Max-Emanuel
    Mandl, Peter
    22ND INTERNATIONAL CONFERENCE ON INFORMATION INTEGRATION AND WEB-BASED APPLICATIONS & SERVICES (IIWAS2020), 2020, : 313 - 321
  • [32] Performance evaluation of Apache Hadoop and Apache Spark for parallelization of compute-intensive tasks
    Döschl, Alexander
    Keller, Max-Emanuel
    Mandl, Peter
    ACM International Conference Proceeding Series, 2020, : 313 - 321
  • [33] QoS Aware Resource Management for Apache Cassandra
    Kishore, Yasaswi
    Datta, Venkat N. H.
    Subramaniam, K. V.
    Sitaram, Dinkar
    2016 23RD IEEE INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING WORKSHOPS (HIPCW 2016), 2016, : 3 - 10
  • [34] Spatial data management in apache spark: the GeoSpark perspective and beyond
    Jia Yu
    Zongsi Zhang
    Mohamed Sarwat
    GeoInformatica, 2019, 23 : 37 - 78
  • [35] Spatial data management in apache spark: the GeoSpark perspective and beyond
    Yu, Jia
    Zhang, Zongsi
    Sarwat, Mohamed
    GEOINFORMATICA, 2019, 23 (01) : 37 - 78
  • [36] Statistical Analysis of the Performance of Four Apache Spark ML Algorithms
    Camele, Genaro
    Hasperue, Waldo
    Ronchetti, Franco
    Quiroga, Facundo Manuel
    JOURNAL OF COMPUTER SCIENCE & TECHNOLOGY, 2022, 22 (02): : 175 - 182
  • [37] Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark
    Mavridis, Ilias
    Karatza, Helen
    JOURNAL OF SYSTEMS AND SOFTWARE, 2017, 125 : 133 - 151
  • [38] Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark
    Xu, Bo
    Li, Changlong
    Zhuang, Hang
    Wang, Jiali
    Wang, Qingfeng
    Zhou, Xuehai
    2017 IEEE 10TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD), 2017, : 608 - 615
  • [39] Efficient Matrix Computation for SGD-Based Algorithms on Apache Spark
    Han, Baokun
    Chen, Zihao
    Xu, Chen
    Zhou, Aoying
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2022, PT I, 2022, : 309 - 324
  • [40] GeoMatch: Efficient Large-Scale Map Matching on Apache Spark
    Zeidan, Ayman
    Lagerspetz, Eemil
    Zhao, Kai
    Nurmi, Petteri
    Tarkoma, Sasu
    Vo, Huy T.
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 384 - 391