Dataset Popularity Prediction for Caching of CMS Big Data

被引:10
|
作者
Meoni, Marco [1 ,2 ]
Perego, Raffaele [2 ]
Tonellotto, Nicola [2 ]
机构
[1] INFN, Pisa, Italy
[2] ISTI CNR, Pisa, Italy
关键词
Machine learning; Big data; Dataset popularity; Classification; Caching strategies;
D O I
10.1007/s10723-018-9436-4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The Compact Muon Solenoid (CMS) experiment at the European Organization for Nuclear Research (CERN) deploys its data collections, simulation and analysis activities on a distributed computing infrastructure involving more than 70 sites worldwide. The historical usage data recorded by this large infrastructure is a rich source of information for system tuning and capacity planning. In this paper we investigate how to leverage machine learning on this huge amount of data in order to discover patterns and correlations useful to enhance the overall efficiency of the distributed infrastructure in terms of CPU utilization and task completion time. In particular we propose a scalable pipeline of components built on top of the Spark engine for large-scale data processing, whose goal is collecting from different sites the dataset access logs, organizing them into weekly snapshots, and training, on these snapshots, predictive models able to forecast which datasets will become popular over time. The high accuracy achieved indicates the ability of the learned model to correctly separate popular datasets from unpopular ones. Dataset popularity predictions are then exploited within a novel data caching policy, called PPC (Popularity Prediction Caching). We evaluate the performance of PPC against popular caching policy baselines like LRU (Least Recently Used). The experiments conducted on large traces of real dataset accesses show that PPC outperforms LRU reducing the number of cache misses up to 20% in some sites.
引用
收藏
页码:211 / 228
页数:18
相关论文
共 50 条
  • [1] Dataset Popularity Prediction for Caching of CMS Big Data
    Marco Meoni
    Raffaele Perego
    Nicola Tonellotto
    [J]. Journal of Grid Computing, 2018, 16 : 211 - 228
  • [2] Predicting dataset popularity for the CMS experiment
    Kuznetsov, V
    Li, T.
    Giommi, L.
    Bonacorsi, D.
    Wildish, T.
    [J]. 17TH INTERNATIONAL WORKSHOP ON ADVANCED COMPUTING AND ANALYSIS TECHNIQUES IN PHYSICS RESEARCH (ACAT2016), 2016, 762
  • [3] Big Data Analytics for Popularity Prediction
    Murthy, G. Vishnu
    SwathiReddy, M.
    Balakrishna, G.
    [J]. INTERNATIONAL CONFERENCE ON COMPUTER VISION AND MACHINE LEARNING, 2019, 1228
  • [4] Clustered Popularity Prediction for Content Caching
    Chen, Qi
    Wang, Wei
    Zhang, Zhaoyang
    [J]. ICC 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2019,
  • [5] PPC: Popularity Prediction Caching in ICN
    Zhang, Yuanzun
    Tan, Xiaobin
    Li, Weiping
    [J]. IEEE COMMUNICATIONS LETTERS, 2018, 22 (01) : 5 - 8
  • [6] Cooperative Caching with Content Popularity Prediction for Mobile Edge Caching
    Sun, Sanshan
    Jiang, Wei
    Feng, Gang
    Qin, Shuang
    Yuan, Ye
    [J]. TEHNICKI VJESNIK-TECHNICAL GAZETTE, 2019, 26 (02): : 503 - 509
  • [7] A Probabilistic Analysis of Data Popularity in ATLAS Data Caching
    Titov, M.
    Zaruba, G.
    Klimentov, A.
    De, K.
    [J]. INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS 2012 (CHEP2012), PTS 1-6, 2012, 396
  • [8] Big Data Analytics for Program Popularity Prediction in Broadcast TV Industries
    Zhu, Chengang
    Cheng, Guang
    Wang, Kun
    [J]. IEEE ACCESS, 2017, 5 : 24593 - 24601
  • [9] Impact of Prediction Uncertainty of Popularity Distribution on Proactive Caching
    Cong, Pengyu
    Qi, Kaiqiang
    Yang, Chenyang
    [J]. 2019 IEEE/CIC INTERNATIONAL CONFERENCE ON COMMUNICATIONS IN CHINA (ICCC), 2019,
  • [10] Popularity prediction–based caching in content delivery networks
    Nesrine Ben Hassine
    Pascale Minet
    Dana Marinca
    Dominique Barth
    [J]. Annals of Telecommunications, 2019, 74 : 351 - 364