Efficient Performance Prediction for Apache Spark

被引:25
|
作者
Cheng, Guoli [1 ]
Ying, Shi [1 ]
Wang, Bingming [1 ]
Li, Yuhang [1 ]
机构
[1] Wuhan Univ, Sch Comp Sci, Bayi Rd 299, Wuhan, Peoples R China
基金
中国国家自然科学基金;
关键词
Performance prediction; Spark; System configuration; Adaboost; Projective sampling;
D O I
10.1016/j.jpdc.2020.10.010
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Spark is a more efficient distributed big data processing framework following Hadoop. It provides users with more than 180 adjustable configuration parameters, and how to choose the optimal configuration automatically to make the Spark application run effectively is challenging. The key to address the above challenge is having the ability to predict the performance of Spark applications in different configurations. This paper proposes a new approach based on Adaboost, which can efficiently and accurately predict the performance of a given application with a given Spark configuration. In our approach, Adaboost is used to build a set of performance models at the stage-level for Spark. To minimize the overhead of the modeling, we use the classic projective sampling, a data mining technique that allows us to collect as few training samples as possible while meeting the accuracy requirements. We evaluate the proposed approach on six typical Spark benchmarks with five input datasets. The experimental results show that our approach is less than the previously proposed approach in prediction error and cost. (C) 2020 Elsevier Inc. All rights reserved.
引用
收藏
页码:40 / 51
页数:12
相关论文
共 50 条
  • [31] Applying Apache Spark on Streaming Big Data for Health Status Prediction
    Ebada, Ahmed Ismail
    Elhenawy, Ibrahim
    Jeong, Chang-Won
    Nam, Yunyoung
    Elbakry, Hazem
    Abdelrazek, Samir
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (02): : 3511 - 3527
  • [32] GeoMatch: Efficient Large-Scale Map Matching on Apache Spark
    Zeidan, Ayman
    Lagerspetz, Eemil
    Zhao, Kai
    Nurmi, Petteri
    Tarkoma, Sasu
    Vo, Huy T.
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 384 - 391
  • [33] Efficient Matrix Computation for SGD-Based Algorithms on Apache Spark
    Han, Baokun
    Chen, Zihao
    Xu, Chen
    Zhou, Aoying
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2022, PT I, 2022, : 309 - 324
  • [34] Data Processing Performance of Apache Spark on Beowulf Clusters: An Overview
    Cluci, Marius-Iulian
    Fotache, Mann
    Greavu-Serban, Valerica
    VISION 2025: EDUCATION EXCELLENCE AND MANAGEMENT OF INNOVATIONS THROUGH SUSTAINABLE ECONOMIC COMPETITIVE ADVANTAGE, 2019, : 12929 - 12938
  • [35] Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark
    Xu, Bo
    Li, Changlong
    Zhuang, Hang
    Wang, Jiali
    Wang, Qingfeng
    Zhou, Xuehai
    2017 IEEE 10TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD), 2017, : 608 - 615
  • [36] Performance comparison of Dask and Apache Spark on HPC systems for neuroimaging
    Dugre, Mathieu
    Hayot-Sasson, Valerie
    Glatard, Tristan
    CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2023, 35 (21):
  • [37] Adaptive performance model for dynamic scaling Apache Spark Streaming
    Petrov, Max
    Butakov, Nikolay
    Nasonov, Denis
    Melnik, Mikhail
    7TH INTERNATIONAL YOUNG SCIENTISTS CONFERENCE ON COMPUTATIONAL SCIENCE, YSC2018, 2018, 136 : 109 - 117
  • [38] GeoMatch: Efficient Large-scale Map Matching on Apache Spark
    Zeidan, Ayman
    Lagerspetz, Eemil
    Zhao, Kai
    Nurmi, Petteri
    Tarkoma, Sasu
    Vo, Huy T.
    ACM/IMS Transactions on Data Science, 2020, 1 (03):
  • [39] Tuning configuration of apache spark on public clouds by combining multi-objective optimization and performance prediction model
    Cheng, Guoli
    Ying, Shi
    Wang, Bingming
    JOURNAL OF SYSTEMS AND SOFTWARE, 2021, 180 (180)
  • [40] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
    Ahmed, N.
    Barczak, Andre L. C.
    Susnjak, Teo
    Rashid, Mohammed A.
    JOURNAL OF BIG DATA, 2020, 7 (01)