Efficient Performance Prediction for Apache Spark

被引：25

作者：

Cheng, Guoli ^{[1
]}

Ying, Shi ^{[1
]}

Wang, Bingming ^{[1
]}

Li, Yuhang ^{[1
]}

机构：

[1] Wuhan Univ, Sch Comp Sci, Bayi Rd 299, Wuhan, Peoples R China

来源：

JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING | 2021年 / 149卷

基金：

中国国家自然科学基金;

关键词：

Performance prediction; Spark; System configuration; Adaboost; Projective sampling;

D O I：

10.1016/j.jpdc.2020.10.010

中图分类号：

TP301 [理论、方法];

学科分类号：

081202 ;

摘要：

Spark is a more efficient distributed big data processing framework following Hadoop. It provides users with more than 180 adjustable configuration parameters, and how to choose the optimal configuration automatically to make the Spark application run effectively is challenging. The key to address the above challenge is having the ability to predict the performance of Spark applications in different configurations. This paper proposes a new approach based on Adaboost, which can efficiently and accurately predict the performance of a given application with a given Spark configuration. In our approach, Adaboost is used to build a set of performance models at the stage-level for Spark. To minimize the overhead of the modeling, we use the classic projective sampling, a data mining technique that allows us to collect as few training samples as possible while meeting the accuracy requirements. We evaluate the proposed approach on six typical Spark benchmarks with five input datasets. The experimental results show that our approach is less than the previously proposed approach in prediction error and cost. (C) 2020 Elsevier Inc. All rights reserved.

引用

页码：40 / 51

页数：12

共 50 条

[41] A comprehensive performance analysis of Apache Hadoop and Apache Spark for large scale data sets using HiBench
N. Ahmed
Andre L. C. Barczak
Teo Susnjak
Mohammed A. Rashid
Journal of Big Data, 7
[42] A gray-box modeling methodology for runtime prediction of Apache Spark jobs
Al-Sayeh, Hani
Hagedorn, Stefan
Sattler, Kai-Uwe
Al-Sayeh, Hani (hani-bassam.al-sayeh@tu-ilmenau.de); Hagedorn, Stefan (stefan.hagedorn@tu-ilmenau.de), 1600, Springer (38): : 819 - 839
[43] Performance Evaluation of Machine Learning Algorithms in Apache Spark for Intrusion Detection
Dobson, Anthony
Roy, Kaushik
Yuan, Xiaohong
Xu, Jinsheng
2018 28TH INTERNATIONAL TELECOMMUNICATION NETWORKS AND APPLICATIONS CONFERENCE (ITNAC), 2018, : 374 - 379
[44] A gray-box modeling methodology for runtime prediction of Apache Spark jobs
Al-Sayeh, Hani
Hagedorn, Stefan
Sattler, Kai-Uwe
DISTRIBUTED AND PARALLEL DATABASES, 2020, 38 (04) : 819 - 839
[45] Prediction of Drug Target Sensitivity in Cancer Cell Lines Using Apache Spark
Hussain, Shahid
Ferzund, Javed
Raza-Ul-Haq
JOURNAL OF COMPUTATIONAL BIOLOGY, 2019, 26 (08) : 882 - 889
[46] Effective Prediction of Missing Data on Apache Spark over Multivariable Time Series
Shi, Weiwei
Zhu, Yongxin
Yu, Philip S.
Zhang, Jiawei
Huang, Tian
Wang, Chang
Chen, Yufeng
IEEE TRANSACTIONS ON BIG DATA, 2018, 4 (04) : 473 - 486
[47] Scaling Machine Learning for Target Prediction in Drug Discovery using Apache Spark
Harnie, Dries
Vapirev, Alexander E.
Wegner, Jorg Kurt
Gedich, Andrey
Steijaert, Marvin
Wuyts, Roel
De Meuter, Wolfgang
2015 15TH IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND GRID COMPUTING, 2015, : 871 - 879
[48] Scaling machine learning for target prediction in drug discovery using Apache Spark
Harnie, Dries
Saey, Mathijs
Vapirev, Alexander E.
Wegner, Jorg Kurt
Gedich, Andrey
Steijaert, Marvin
Ceulemans, Hugo
Wuyts, Roel
De Meuter, Wolfgang
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2017, 67 : 409 - 417
[49] A gray-box modeling methodology for runtime prediction of Apache Spark jobs
Hani Al-Sayeh
Stefan Hagedorn
Kai-Uwe Sattler
Distributed and Parallel Databases, 2020, 38 : 819 - 839
[50] Performance Comparison of State of Art NoSql Technologies Using Apache Spark
ul Haque, Anwar
Mahmood, Tariq
Ikram, Nassar
INTELLIGENT SYSTEMS AND APPLICATIONS, INTELLISYS, VOL 2, 2019, 869 : 563 - 576

← 1 2 3 4 5 →