Efficient Performance Prediction for Apache Spark

被引：25

作者：

Cheng, Guoli ^{[1
]}

Ying, Shi ^{[1
]}

Wang, Bingming ^{[1
]}

Li, Yuhang ^{[1
]}

机构：

[1] Wuhan Univ, Sch Comp Sci, Bayi Rd 299, Wuhan, Peoples R China

来源：

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING | 2021年 / 149卷

基金：

中国国家自然科学基金;

关键词：

Performance prediction; Spark; System configuration; Adaboost; Projective sampling;

D O I：

10.1016/j.jpdc.2020.10.010

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Spark is a more efficient distributed big data processing framework following Hadoop. It provides users with more than 180 adjustable configuration parameters, and how to choose the optimal configuration automatically to make the Spark application run effectively is challenging. The key to address the above challenge is having the ability to predict the performance of Spark applications in different configurations. This paper proposes a new approach based on Adaboost, which can efficiently and accurately predict the performance of a given application with a given Spark configuration. In our approach, Adaboost is used to build a set of performance models at the stage-level for Spark. To minimize the overhead of the modeling, we use the classic projective sampling, a data mining technique that allows us to collect as few training samples as possible while meeting the accuracy requirements. We evaluate the proposed approach on six typical Spark benchmarks with five input datasets. The experimental results show that our approach is less than the previously proposed approach in prediction error and cost. (C) 2020 Elsevier Inc. All rights reserved.

引用

页码：40 / 51

页数：12

共 50 条

[31] Applying Apache Spark on Streaming Big Data for Health Status Prediction
Ebada, Ahmed Ismail
Elhenawy, Ibrahim
Jeong, Chang-Won
Nam, Yunyoung
Elbakry, Hazem
Abdelrazek, Samir
CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (02): : 3511 - 3527
[32] GeoMatch: Efficient Large-Scale Map Matching on Apache Spark
Zeidan, Ayman
Lagerspetz, Eemil
Zhao, Kai
Nurmi, Petteri
Tarkoma, Sasu
Vo, Huy T.
2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 384 - 391
[33] Efficient Matrix Computation for SGD-Based Algorithms on Apache Spark
Han, Baokun
Chen, Zihao
Xu, Chen
Zhou, Aoying
DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, DASFAA 2022, PT I, 2022, : 309 - 324
[34] Data Processing Performance of Apache Spark on Beowulf Clusters: An Overview
Cluci, Marius-Iulian
Fotache, Mann
Greavu-Serban, Valerica
VISION 2025: EDUCATION EXCELLENCE AND MANAGEMENT OF INNOVATIONS THROUGH SUSTAINABLE ECONOMIC COMPETITIVE ADVANTAGE, 2019, : 12929 - 12938
[35] Efficient Distributed Smith-Waterman Algorithm Based on Apache Spark
Xu, Bo
Li, Changlong
Zhuang, Hang
Wang, Jiali
Wang, Qingfeng
Zhou, Xuehai
2017 IEEE 10TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING (CLOUD), 2017, : 608 - 615
[36] Performance comparison of Dask and Apache Spark on HPC systems for neuroimaging
Dugre, Mathieu
Hayot-Sasson, Valerie
Glatard, Tristan
CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2023, 35 (21):
[37] Adaptive performance model for dynamic scaling Apache Spark Streaming
Petrov, Max
Butakov, Nikolay
Nasonov, Denis
Melnik, Mikhail
7TH INTERNATIONAL YOUNG SCIENTISTS CONFERENCE ON COMPUTATIONAL SCIENCE, YSC2018, 2018, 136 : 109 - 117
[38] GeoMatch: Efficient Large-scale Map Matching on Apache Spark
Zeidan, Ayman
Lagerspetz, Eemil
Zhao, Kai
Nurmi, Petteri
Tarkoma, Sasu
Vo, Huy T.
ACM/IMS Transactions on Data Science, 2020, 1 (03):
[39] Tuning configuration of apache spark on public clouds by combining multi-objective optimization and performance prediction model
Cheng, Guoli
Ying, Shi
Wang, Bingming
JOURNAL OF SYSTEMS AND SOFTWARE, 2021, 180 (180)
[40] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
Ahmed, N.
Barczak, Andre L. C.
Susnjak, Teo
Rashid, Mohammed A.
JOURNAL OF BIG DATA, 2020, 7 (01)

← 1 2 3 4 5 →