Big data execution time based on Spark Machine Learning Libraries

被引：1

作者：

Garate-Escamilla, Anna Karen ^{[1
]}

Hajjam El Hassani, Amir ^{[1
]}

Andres, Emmanuel ^{[1
,2
]}

机构：

[1] Univ Bourgogne Franche Comte, UTBM, Nanomed Lab, 12 Rue Thierry Mieg,Rue Edouard Branly, F-90000 Belfort, France

[2] CHRU Strasbourg, Serv Med Interne Diabet & Malad Metab Clin Med B, 5 Ave Moliere, F-67200 Strasbourg, France

来源：

PROCEEDINGS OF 2019 3RD INTERNATIONAL CONFERENCE ON CLOUD AND BIG DATA COMPUTING (ICCBDC 2019) | 2019年

关键词：

Machine Learning; Apache Spark; Performance prediction model; Execution time prediction;

D O I：

10.1145/3358505.3358519

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The paper focuses on exploring the time consumption of supervised and unsupervised models of Apache Spark framework in massive datasets. Big Data analytics has been relevant in the industry due to the need to convert information into knowledge. Among the challenge of big data is the creation of strategies to improve the execution costs of running machine learning models to make a prediction. Apache Spark is a powerful in-memory platform that offers an extensive machine learning library for regression, classification, clustering, and rule extraction. This investigation, from a computation cost perspective, performs different experiments using real datasets. The main contribution of the paper is to compare the execution time of different machine learning models, such as random forests, decision tree, logistic regression, linear support vector machine, and kNN. The present work expects to combine the areas of big data and machine learning, comparing the results with different configurations and the use of the optimization methods, cache and persist. The evaluation experiments show that logistic regression performed the shortest execution time of the Spark MLlib models.

引用

下载

页码：78 / 83

页数：6

共 50 条

[21] Research on Parallel Support Vector Machine Based on Spark Big Data Platform
Huimin, Yao
SCIENTIFIC PROGRAMMING, 2021, 2021
[22] Efficient Big Data Analysis on a Single Machine using Apache Spark and Self-Organizing Map Libraries
Andresic, David
Saloun, Petr
Anagnostopoulos, Ioannis
2017 12TH INTERNATIONAL WORKSHOP ON SEMANTIC AND SOCIAL MEDIA ADAPTATION AND PERSONALIZATION (SMAP 2017), 2017, : 1 - 5
[23] Correlation Analysis of Network Big Data and Film Time-Series Data Based on Machine Learning Algorithm
Li, Na
Xia, Langbo
MATHEMATICAL PROBLEMS IN ENGINEERING, 2022, 2022
[24] Granular computing based machine learning in the era of big data
Hu, Qinghua
Mi, Jusheng
Chen, Degang
Information Sciences, 2022, 591 : 422 - 423
[25] Correlation Analysis of Network Big Data and Film Time-Series Data Based on Machine Learning Algorithm
Li, Na
Xia, Langbo
MATHEMATICAL PROBLEMS IN ENGINEERING, 2022, 2022
[26] Big Data Analysis of TV Dramas Based on Machine Learning
Tan, Jiaqi
Mao, Feiqiao
Yang, Lianghai
Wang, Jiahui
SMART COMPUTING AND COMMUNICATION, SMARTCOM 2017, 2018, 10699 : 90 - 95
[27] Implementation of a Self-Adaptive Real Time Recommendation System using Spark Machine Learning Libraries
Sunny, Bobin K.
Janardhanan, P. S.
Francis, Anu Bonia
Murali, Reena
2017 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, INFORMATICS, COMMUNICATION AND ENERGY SYSTEMS (SPICES), 2017,
[28] Scalable Manifold Learning for Big Data with Apache Spark
Schoeneman, Frank
Zola, Jaroslaw
2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 272 - 281
[29] Machine learning for big data analytics
Oja, E. (erkki.oja@aalto.fi), 1600, Springer Verlag (384):
[30] Big data and machine learning in health
Carvalho, D.
Cruz, R.
EUROPEAN JOURNAL OF PUBLIC HEALTH, 2020, 30 : 10 - 11

← 1 2 3 4 5 →