Machine Learning Pipelines with Modern Big Data Tools for High Energy Physics

被引:2
|
作者
M. Migliorini
R. Castellotti
L. Canali
M. Zanetti
机构
[1] European Organization for Nuclear Research (CERN),
[2] University of Padova,undefined
关键词
Big Data; Machine Learning; HEP; Distributed computing; Parallel computing; GPU;
D O I
10.1007/s41781-020-00040-0
中图分类号
学科分类号
摘要
The effective utilization at scale of complex machine learning (ML) techniques for HEP use cases poses several technological challenges, most importantly on the actual implementation of dedicated end-to-end data pipelines. A solution to these challenges is presented, which allows training neural network classifiers using solutions from the Big Data and data science ecosystems, integrated with tools, software, and platforms common in the HEP environment. In particular, Apache Spark is exploited for data preparation and feature engineering, running the corresponding (Python) code interactively on Jupyter notebooks. Key integrations and libraries that make Spark capable of ingesting data stored using ROOT format and accessed via the XRootD protocol, are described and discussed. Training of the neural network models, defined using the Keras API, is performed in a distributed fashion on Spark clusters by using BigDL with Analytics Zoo and also by using TensorFlow, notably for distributed training on CPU and GPU resources. The implementation and the results of the distributed training are described in detail in this work.
引用
收藏
相关论文
共 50 条
  • [31] Big Data Analysis Using Modern Statistical and Machine Learning Methods in Medicine
    Yoo, Changwon
    Ramirez, Luis
    Liuzzi, Juan
    INTERNATIONAL NEUROUROLOGY JOURNAL, 2014, 18 (02) : 50 - 57
  • [32] Data distribution debugging in machine learning pipelines
    Stefan Grafberger
    Paul Groth
    Julia Stoyanovich
    Sebastian Schelter
    The VLDB Journal, 2022, 31 : 1103 - 1126
  • [33] Data distribution debugging in machine learning pipelines
    Grafberger, Stefan
    Groth, Paul
    Stoyanovich, Julia
    Schelter, Sebastian
    VLDB JOURNAL, 2022, 31 (05): : 1103 - 1126
  • [34] System Simulation tools for Data Acquisition in High Energy Physics Experiments
    Aloisio, Alberto
    Cavaliere, Sergio
    SIMUL: 2009 FIRST INTERNATIONAL CONFERENCE ON ADVANCES IN SYSTEM SIMULATION, 2009, : 119 - 124
  • [35] High Energy Physics and Big Science
    Alvarez-Gaume, L
    ARBOR-CIENCIA PENSAMIENTO Y CULTURA, 1998, 159 (626) : 185 - 192
  • [36] Advanced ECHMM-Based Machine Learning Tools for Complex Big Data Applications
    Cuzzocrea, Alfredo
    Mumolo, Enzo
    Vercelli, Gianni
    2017 16TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA), 2017, : 655 - 660
  • [37] Machine learning for big data analytics
    Oja, E. (erkki.oja@aalto.fi), 1600, Springer Verlag (384):
  • [38] Big data and machine learning in health
    Carvalho, D.
    Cruz, R.
    EUROPEAN JOURNAL OF PUBLIC HEALTH, 2020, 30 : 10 - 11
  • [39] Machine learning and big scientific data
    Hey, Tony
    Butler, Keith
    Jackson, Sam
    Thiyagalingam, Jeyarajan
    PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY A-MATHEMATICAL PHYSICAL AND ENGINEERING SCIENCES, 2020, 378 (2166):
  • [40] Machine Learning under Big Data
    Shi, Chunhe
    Wu, Chengdong
    Han, Xiaowei
    Xie, Yinghong
    Li, Zhen
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON ELECTRONIC, MECHANICAL, INFORMATION AND MANAGEMENT SOCIETY (EMIM), 2016, 40 : 301 - 305