LPW: an efficient data-aware cache replacement strategy for Apache Spark

被引：1

作者：

Li, Hui ^{[1
,2
]}

Ji, Shuping ^{[1
]}

Zhong, Hua ^{[1
]}

Wang, Wei ^{[1
,2
,3
,4
]}

Xu, Lijie ^{[1
,2
,3
,4
]}

Tang, Zhen ^{[1
]}

Wei, Jun ^{[1
,2
]}

Huang, Tao ^{[1
]}

机构：

[1] Chinese Acad Sci, Inst Software, State Key Lab Comp Sci, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China

[3] Nanjing Inst Software Technol, Nanjing 210000, Peoples R China

[4] Univ Chinese Acad Sci, Nanjing 210008, Peoples R China

来源：

SCIENCE CHINA-INFORMATION SCIENCES | 2023年 / 66卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Spark; memory; cache replacement; least partition weight; data-aware;

D O I：

10.1007/s11432-021-3406-5

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Caching is one of the most important techniques for the popular distributed big data processing framework Spark. For this big data parallel computing framework, which is designed to support various applications based on in-memory computing, it is not possible to cache every intermediate result due to the memory size limitation. The arbitrariness of cache application programming interface (API) usage, the diversity of application characteristics, and the variability of memory resources constitute challenges to achieving high system execution performance. Inefficient cache replacement strategies may cause different performance problems such as long application execution time, low memory utilization, high replacement frequency, and even program execution failure resulting from out of memory. The cache replacement strategy currently adopted by Spark is the least recently used (LRU) strategy. Although LRU is a classical algorithm and has been widely used, it lacks consideration for the environment and workloads. As a result, it cannot achieve good performance under many scenarios. In this paper, we propose a novel cache replacement algorithm, least partition weight (LPW). LPW takes comprehensive consideration of different factors affecting system performance, such as partition size, computational cost, and reference count. The LPW algorithm was implemented in Spark and compared against the LRU as well as other state-of-the-art mechanisms. Our detailed experiments indicate that LPW obviously outperforms its counterparts and can reduce the execution time by up to 75% under typical workloads. Furthermore, the decreasing eviction frequency also shows the LPW algorithm can generate more reasonable predictions.

引用

页数：20

共 50 条

[1] LPW: an efficient data-aware cache replacement strategy for Apache Spark
Hui Li
Shuping Ji
Hua Zhong
Wei Wang
Lijie Xu
Zhen Tang
Jun Wei
Tao Huang
[J]. Science China Information Sciences, 2023, 66
[2] LPW: an efficient data-aware cache replacement strategy for Apache Spark
Hui LI
Shuping JI
Hua ZHONG
Wei WANG
Lijie XU
Zhen TANG
Jun WEI
Tao HUANG
[J]. Science China(Information Sciences), 2023, 66 (01) : 77 - 96
[3] A Memory-Aware Spark Cache Replacement Strategy
Zhang, Jingyu
Zhang, Ruihan
Alfarraj, Osama
Tolba, Amr
Kim, Gwang-Jun
[J]. JOURNAL OF INTERNET TECHNOLOGY, 2022, 23 (06): : 1185 - 1190
[4] Data-Aware Cache Management for Graph Analytics
Sharma, Neelam
Venkitaraman, Varun
Newton
Kumar, Vikash
Singhania, Shubham
Jha, Chandan Kumar
[J]. PROCEEDINGS OF THE 2022 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE 2022), 2022, : 843 - 848
[5] Effective data management strategy and RDD weight cache replacement strategy in Spark
Jiang, Kun
Du, Shaofeng
Zhao, Fu
Huang, Yong
Li, Chunlin
Luo, Youlong
[J]. COMPUTER COMMUNICATIONS, 2022, 194 : 66 - 85
[6] Intermediate data placement and cache replacement strategy under Spark platform
Li, Chunlin
Zhang, Yong
Luo, Youlong
[J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2022, 163 : 114 - 135
[7] Efficient Incremental Data Analytics with Apache Spark
Gholamian, Sina
Golab, Wojciech
Ward, Paul A. S.
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 2859 - 2868
[8] A data-aware scheduling strategy for workflow execution in clouds
Marozzo, Fabrizio
Rodrigo Duro, Francisco
Garcia Blas, Javier
Carretero, Jesus
Talia, Domenico
Trunfio, Paolo
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (24):
[9] An efficient cache replacement strategy for the hybrid cache consistency approach
Zeitunlian, Aline
Haraty, Ramzi A.
[J]. World Academy of Science, Engineering and Technology, 2010, 63 : 268 - 273
[10] A Data-aware Learned Index Scheme for Efficient Writes
Liu, Li
Li, Chunhua
Zhang, Zhou
Liu, Yuhan
Zhou, Ke
Zhang, Ji
[J]. 51ST INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2022, 2022,

← 1 2 3 4 5 →