LPW: an efficient data-aware cache replacement strategy for Apache Spark

被引:1
|
作者
Li, Hui [1 ,2 ]
Ji, Shuping [1 ]
Zhong, Hua [1 ]
Wang, Wei [1 ,2 ,3 ,4 ]
Xu, Lijie [1 ,2 ,3 ,4 ]
Tang, Zhen [1 ]
Wei, Jun [1 ,2 ]
Huang, Tao [1 ]
机构
[1] Chinese Acad Sci, Inst Software, State Key Lab Comp Sci, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[3] Nanjing Inst Software Technol, Nanjing 210000, Peoples R China
[4] Univ Chinese Acad Sci, Nanjing 210008, Peoples R China
基金
中国国家自然科学基金;
关键词
Spark; memory; cache replacement; least partition weight; data-aware;
D O I
10.1007/s11432-021-3406-5
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Caching is one of the most important techniques for the popular distributed big data processing framework Spark. For this big data parallel computing framework, which is designed to support various applications based on in-memory computing, it is not possible to cache every intermediate result due to the memory size limitation. The arbitrariness of cache application programming interface (API) usage, the diversity of application characteristics, and the variability of memory resources constitute challenges to achieving high system execution performance. Inefficient cache replacement strategies may cause different performance problems such as long application execution time, low memory utilization, high replacement frequency, and even program execution failure resulting from out of memory. The cache replacement strategy currently adopted by Spark is the least recently used (LRU) strategy. Although LRU is a classical algorithm and has been widely used, it lacks consideration for the environment and workloads. As a result, it cannot achieve good performance under many scenarios. In this paper, we propose a novel cache replacement algorithm, least partition weight (LPW). LPW takes comprehensive consideration of different factors affecting system performance, such as partition size, computational cost, and reference count. The LPW algorithm was implemented in Spark and compared against the LRU as well as other state-of-the-art mechanisms. Our detailed experiments indicate that LPW obviously outperforms its counterparts and can reduce the execution time by up to 75% under typical workloads. Furthermore, the decreasing eviction frequency also shows the LPW algorithm can generate more reasonable predictions.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] LPW: an efficient data-aware cache replacement strategy for Apache Spark
    Hui Li
    Shuping Ji
    Hua Zhong
    Wei Wang
    Lijie Xu
    Zhen Tang
    Jun Wei
    Tao Huang
    [J]. Science China Information Sciences, 2023, 66
  • [2] LPW: an efficient data-aware cache replacement strategy for Apache Spark
    Hui LI
    Shuping JI
    Hua ZHONG
    Wei WANG
    Lijie XU
    Zhen TANG
    Jun WEI
    Tao HUANG
    [J]. Science China(Information Sciences), 2023, 66 (01) : 77 - 96
  • [3] A Memory-Aware Spark Cache Replacement Strategy
    Zhang, Jingyu
    Zhang, Ruihan
    Alfarraj, Osama
    Tolba, Amr
    Kim, Gwang-Jun
    [J]. JOURNAL OF INTERNET TECHNOLOGY, 2022, 23 (06): : 1185 - 1190
  • [4] Data-Aware Cache Management for Graph Analytics
    Sharma, Neelam
    Venkitaraman, Varun
    Newton
    Kumar, Vikash
    Singhania, Shubham
    Jha, Chandan Kumar
    [J]. PROCEEDINGS OF THE 2022 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE 2022), 2022, : 843 - 848
  • [5] Effective data management strategy and RDD weight cache replacement strategy in Spark
    Jiang, Kun
    Du, Shaofeng
    Zhao, Fu
    Huang, Yong
    Li, Chunlin
    Luo, Youlong
    [J]. COMPUTER COMMUNICATIONS, 2022, 194 : 66 - 85
  • [6] Intermediate data placement and cache replacement strategy under Spark platform
    Li, Chunlin
    Zhang, Yong
    Luo, Youlong
    [J]. JOURNAL OF PARALLEL AND DISTRIBUTED COMPUTING, 2022, 163 : 114 - 135
  • [7] Efficient Incremental Data Analytics with Apache Spark
    Gholamian, Sina
    Golab, Wojciech
    Ward, Paul A. S.
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2017, : 2859 - 2868
  • [8] A data-aware scheduling strategy for workflow execution in clouds
    Marozzo, Fabrizio
    Rodrigo Duro, Francisco
    Garcia Blas, Javier
    Carretero, Jesus
    Talia, Domenico
    Trunfio, Paolo
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2017, 29 (24):
  • [9] An efficient cache replacement strategy for the hybrid cache consistency approach
    Zeitunlian, Aline
    Haraty, Ramzi A.
    [J]. World Academy of Science, Engineering and Technology, 2010, 63 : 268 - 273
  • [10] A Data-aware Learned Index Scheme for Efficient Writes
    Liu, Li
    Li, Chunhua
    Zhang, Zhou
    Liu, Yuhan
    Zhou, Ke
    Zhang, Ji
    [J]. 51ST INTERNATIONAL CONFERENCE ON PARALLEL PROCESSING, ICPP 2022, 2022,