Dataset Popularity Prediction for Caching of CMS Big Data

被引：10

作者：

Meoni, Marco ^{[1
,2
]}

Perego, Raffaele ^{[2
]}

Tonellotto, Nicola ^{[2
]}

机构：

[1] INFN, Pisa, Italy

[2] ISTI CNR, Pisa, Italy

来源：

JOURNAL OF GRID COMPUTING | 2018年 / 16卷 / 02期

关键词：

Machine learning; Big data; Dataset popularity; Classification; Caching strategies;

D O I：

10.1007/s10723-018-9436-4

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The Compact Muon Solenoid (CMS) experiment at the European Organization for Nuclear Research (CERN) deploys its data collections, simulation and analysis activities on a distributed computing infrastructure involving more than 70 sites worldwide. The historical usage data recorded by this large infrastructure is a rich source of information for system tuning and capacity planning. In this paper we investigate how to leverage machine learning on this huge amount of data in order to discover patterns and correlations useful to enhance the overall efficiency of the distributed infrastructure in terms of CPU utilization and task completion time. In particular we propose a scalable pipeline of components built on top of the Spark engine for large-scale data processing, whose goal is collecting from different sites the dataset access logs, organizing them into weekly snapshots, and training, on these snapshots, predictive models able to forecast which datasets will become popular over time. The high accuracy achieved indicates the ability of the learned model to correctly separate popular datasets from unpopular ones. Dataset popularity predictions are then exploited within a novel data caching policy, called PPC (Popularity Prediction Caching). We evaluate the performance of PPC against popular caching policy baselines like LRU (Least Recently Used). The experiments conducted on large traces of real dataset accesses show that PPC outperforms LRU reducing the number of cache misses up to 20% in some sites.

引用

页码：211 / 228

页数：18

共 50 条

[1] Dataset Popularity Prediction for Caching of CMS Big Data
Marco Meoni
Raffaele Perego
Nicola Tonellotto
[J]. Journal of Grid Computing, 2018, 16 : 211 - 228
[2] Predicting dataset popularity for the CMS experiment
Kuznetsov, V
Li, T.
Giommi, L.
Bonacorsi, D.
Wildish, T.
[J]. 17TH INTERNATIONAL WORKSHOP ON ADVANCED COMPUTING AND ANALYSIS TECHNIQUES IN PHYSICS RESEARCH (ACAT2016), 2016, 762
[3] Big Data Analytics for Popularity Prediction
Murthy, G. Vishnu
SwathiReddy, M.
Balakrishna, G.
[J]. INTERNATIONAL CONFERENCE ON COMPUTER VISION AND MACHINE LEARNING, 2019, 1228
[4] Clustered Popularity Prediction for Content Caching
Chen, Qi
Wang, Wei
Zhang, Zhaoyang
[J]. ICC 2019 - 2019 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2019,
[5] PPC: Popularity Prediction Caching in ICN
Zhang, Yuanzun
Tan, Xiaobin
Li, Weiping
[J]. IEEE COMMUNICATIONS LETTERS, 2018, 22 (01) : 5 - 8
[6] Cooperative Caching with Content Popularity Prediction for Mobile Edge Caching
Sun, Sanshan
Jiang, Wei
Feng, Gang
Qin, Shuang
Yuan, Ye
[J]. TEHNICKI VJESNIK-TECHNICAL GAZETTE, 2019, 26 (02): : 503 - 509
[7] A Probabilistic Analysis of Data Popularity in ATLAS Data Caching
Titov, M.
Zaruba, G.
Klimentov, A.
De, K.
[J]. INTERNATIONAL CONFERENCE ON COMPUTING IN HIGH ENERGY AND NUCLEAR PHYSICS 2012 (CHEP2012), PTS 1-6, 2012, 396
[8] Big Data Analytics for Program Popularity Prediction in Broadcast TV Industries
Zhu, Chengang
Cheng, Guang
Wang, Kun
[J]. IEEE ACCESS, 2017, 5 : 24593 - 24601
[9] Impact of Prediction Uncertainty of Popularity Distribution on Proactive Caching
Cong, Pengyu
Qi, Kaiqiang
Yang, Chenyang
[J]. 2019 IEEE/CIC INTERNATIONAL CONFERENCE ON COMMUNICATIONS IN CHINA (ICCC), 2019,
[10] Popularity prediction–based caching in content delivery networks
Nesrine Ben Hassine
Pascale Minet
Dana Marinca
Dominique Barth
[J]. Annals of Telecommunications, 2019, 74 : 351 - 364

← 1 2 3 4 5 →