Big data execution time based on Spark Machine Learning Libraries

被引:1
|
作者
Garate-Escamilla, Anna Karen [1 ]
Hajjam El Hassani, Amir [1 ]
Andres, Emmanuel [1 ,2 ]
机构
[1] Univ Bourgogne Franche Comte, UTBM, Nanomed Lab, 12 Rue Thierry Mieg,Rue Edouard Branly, F-90000 Belfort, France
[2] CHRU Strasbourg, Serv Med Interne Diabet & Malad Metab Clin Med B, 5 Ave Moliere, F-67200 Strasbourg, France
关键词
Machine Learning; Apache Spark; Performance prediction model; Execution time prediction;
D O I
10.1145/3358505.3358519
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The paper focuses on exploring the time consumption of supervised and unsupervised models of Apache Spark framework in massive datasets. Big Data analytics has been relevant in the industry due to the need to convert information into knowledge. Among the challenge of big data is the creation of strategies to improve the execution costs of running machine learning models to make a prediction. Apache Spark is a powerful in-memory platform that offers an extensive machine learning library for regression, classification, clustering, and rule extraction. This investigation, from a computation cost perspective, performs different experiments using real datasets. The main contribution of the paper is to compare the execution time of different machine learning models, such as random forests, decision tree, logistic regression, linear support vector machine, and kNN. The present work expects to combine the areas of big data and machine learning, comparing the results with different configurations and the use of the optimization methods, cache and persist. The evaluation experiments show that logistic regression performed the shortest execution time of the Spark MLlib models.
引用
下载
收藏
页码:78 / 83
页数:6
相关论文
共 50 条
  • [21] Research on Parallel Support Vector Machine Based on Spark Big Data Platform
    Huimin, Yao
    SCIENTIFIC PROGRAMMING, 2021, 2021
  • [22] Efficient Big Data Analysis on a Single Machine using Apache Spark and Self-Organizing Map Libraries
    Andresic, David
    Saloun, Petr
    Anagnostopoulos, Ioannis
    2017 12TH INTERNATIONAL WORKSHOP ON SEMANTIC AND SOCIAL MEDIA ADAPTATION AND PERSONALIZATION (SMAP 2017), 2017, : 1 - 5
  • [23] Correlation Analysis of Network Big Data and Film Time-Series Data Based on Machine Learning Algorithm
    Li, Na
    Xia, Langbo
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2022, 2022
  • [24] Granular computing based machine learning in the era of big data
    Hu, Qinghua
    Mi, Jusheng
    Chen, Degang
    Information Sciences, 2022, 591 : 422 - 423
  • [25] Correlation Analysis of Network Big Data and Film Time-Series Data Based on Machine Learning Algorithm
    Li, Na
    Xia, Langbo
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2022, 2022
  • [26] Big Data Analysis of TV Dramas Based on Machine Learning
    Tan, Jiaqi
    Mao, Feiqiao
    Yang, Lianghai
    Wang, Jiahui
    SMART COMPUTING AND COMMUNICATION, SMARTCOM 2017, 2018, 10699 : 90 - 95
  • [27] Implementation of a Self-Adaptive Real Time Recommendation System using Spark Machine Learning Libraries
    Sunny, Bobin K.
    Janardhanan, P. S.
    Francis, Anu Bonia
    Murali, Reena
    2017 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, INFORMATICS, COMMUNICATION AND ENERGY SYSTEMS (SPICES), 2017,
  • [28] Scalable Manifold Learning for Big Data with Apache Spark
    Schoeneman, Frank
    Zola, Jaroslaw
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 272 - 281
  • [29] Machine learning for big data analytics
    Oja, E. (erkki.oja@aalto.fi), 1600, Springer Verlag (384):
  • [30] Big data and machine learning in health
    Carvalho, D.
    Cruz, R.
    EUROPEAN JOURNAL OF PUBLIC HEALTH, 2020, 30 : 10 - 11