Efficient Performance Prediction for Apache Spark

被引:25
|
作者
Cheng, Guoli [1 ]
Ying, Shi [1 ]
Wang, Bingming [1 ]
Li, Yuhang [1 ]
机构
[1] Wuhan Univ, Sch Comp Sci, Bayi Rd 299, Wuhan, Peoples R China
基金
中国国家自然科学基金;
关键词
Performance prediction; Spark; System configuration; Adaboost; Projective sampling;
D O I
10.1016/j.jpdc.2020.10.010
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Spark is a more efficient distributed big data processing framework following Hadoop. It provides users with more than 180 adjustable configuration parameters, and how to choose the optimal configuration automatically to make the Spark application run effectively is challenging. The key to address the above challenge is having the ability to predict the performance of Spark applications in different configurations. This paper proposes a new approach based on Adaboost, which can efficiently and accurately predict the performance of a given application with a given Spark configuration. In our approach, Adaboost is used to build a set of performance models at the stage-level for Spark. To minimize the overhead of the modeling, we use the classic projective sampling, a data mining technique that allows us to collect as few training samples as possible while meeting the accuracy requirements. We evaluate the proposed approach on six typical Spark benchmarks with five input datasets. The experimental results show that our approach is less than the previously proposed approach in prediction error and cost. (C) 2020 Elsevier Inc. All rights reserved.
引用
收藏
页码:40 / 51
页数:12
相关论文
共 50 条
  • [41] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
    N. Ahmed
    Andre L. C. Barczak
    Teo Susnjak
    Mohammed A. Rashid
    Journal of Big Data, 7
  • [42] A gray-box modeling methodology for runtime prediction of Apache Spark jobs
    Al-Sayeh, Hani
    Hagedorn, Stefan
    Sattler, Kai-Uwe
    Al-Sayeh, Hani (hani-bassam.al-sayeh@tu-ilmenau.de); Hagedorn, Stefan (stefan.hagedorn@tu-ilmenau.de), 1600, Springer (38): : 819 - 839
  • [43] Performance Evaluation of Machine Learning Algorithms in Apache Spark for Intrusion Detection
    Dobson, Anthony
    Roy, Kaushik
    Yuan, Xiaohong
    Xu, Jinsheng
    2018 28TH INTERNATIONAL TELECOMMUNICATION NETWORKS AND APPLICATIONS CONFERENCE (ITNAC), 2018, : 374 - 379
  • [44] A gray-box modeling methodology for runtime prediction of Apache Spark jobs
    Al-Sayeh, Hani
    Hagedorn, Stefan
    Sattler, Kai-Uwe
    DISTRIBUTED AND PARALLEL DATABASES, 2020, 38 (04) : 819 - 839
  • [45] Prediction of Drug Target Sensitivity in Cancer Cell Lines Using Apache Spark
    Hussain, Shahid
    Ferzund, Javed
    Raza-Ul-Haq
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2019, 26 (08) : 882 - 889
  • [46] Effective Prediction of Missing Data on Apache Spark over Multivariable Time Series
    Shi, Weiwei
    Zhu, Yongxin
    Yu, Philip S.
    Zhang, Jiawei
    Huang, Tian
    Wang, Chang
    Chen, Yufeng
    IEEE TRANSACTIONS ON BIG DATA, 2018, 4 (04) : 473 - 486
  • [47] Scaling Machine Learning for Target Prediction in Drug Discovery using Apache Spark
    Harnie, Dries
    Vapirev, Alexander E.
    Wegner, Jorg Kurt
    Gedich, Andrey
    Steijaert, Marvin
    Wuyts, Roel
    De Meuter, Wolfgang
    2015 15TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING, 2015, : 871 - 879
  • [48] Scaling machine learning for target prediction in drug discovery using Apache Spark
    Harnie, Dries
    Saey, Mathijs
    Vapirev, Alexander E.
    Wegner, Jorg Kurt
    Gedich, Andrey
    Steijaert, Marvin
    Ceulemans, Hugo
    Wuyts, Roel
    De Meuter, Wolfgang
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2017, 67 : 409 - 417
  • [49] A gray-box modeling methodology for runtime prediction of Apache Spark jobs
    Hani Al-Sayeh
    Stefan Hagedorn
    Kai-Uwe Sattler
    Distributed and Parallel Databases, 2020, 38 : 819 - 839
  • [50] Performance Comparison of State of Art NoSql Technologies Using Apache Spark
    ul Haque, Anwar
    Mahmood, Tariq
    Ikram, Nassar
    INTELLIGENT SYSTEMS AND APPLICATIONS, INTELLISYS, VOL 2, 2019, 869 : 563 - 576