Evaluation of Machine Learning Frameworks on Bank Marketing and Higgs Datasets

被引:4
|
作者
Shashidhara, Bhuvan M. [1 ]
Jain, Siddharth [1 ]
Rao, Vinay D. [1 ]
Patil, Nagamma [1 ]
Raghavendra, G. S. [1 ]
机构
[1] Natl Inst Technol Karnataka, Dept Informat Technol, Surathkal, India
关键词
Machine Learning Algorithms; Big Data; Parallel Execution; Distributed Computing; WEKA; Scikit-Learn; Apache Spark;
D O I
10.1109/ICACCE.2015.31
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Big data is an emerging field with different datasets of various sizes are being analyzed for potential applications. In parallel, many frameworks are being introduced where these datasets can be fed into machine learning algorithms. Though some experiments have been done to compare different machine learning algorithms on different data, these experiments have not been tested out on different platforms. Our research aims to compare two selected machine learning algorithms on data sets of different sizes deployed on different platforms like Weka, Scikit-Learn and Apache Spark. They are evaluated based on Training time, Accuracy and Root mean squared error. This comparison helps us to decide what platform is best suited to work while applying computationally expensive selected machine learning algorithms on a particular size of data. Experiments suggested that Scikit-Learn would be optimal on data which can fit into memory. While working with huge, data Apache Spark would be optimal as it performs parallel computations by distributing the data over a cluster. Hence this study concludes that spark platform which has growing support for parallel implementation of machine learning algorithms could be optimal to analyze big data.
引用
收藏
页码:551 / 555
页数:5
相关论文
共 50 条
  • [31] Comparison of Machine Learning Algorithms on Different Datasets
    Uysal, Elif
    Ozturk, Ali
    2018 26TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2018,
  • [32] DescribeML: A Tool for Describing Machine Learning Datasets
    Giner-Miguelez, Joan
    Gomez, Abel
    Cabot, Jordi
    ACM/IEEE 25TH INTERNATIONAL CONFERENCE ON MODEL DRIVEN ENGINEERING LANGUAGES AND SYSTEMS, MODELS 2022 COMPANION, 2022, : 22 - 26
  • [33] Machine learning with remote sensing image datasets
    Petrovska, Biserka
    Atanasova-Pacemska, Tatjana
    Stojkovik, Natasa
    Stojanova, Aleksandra
    Kocaleva, Mirjana
    Informatica (Slovenia), 2021, 45 (03): : 347 - 358
  • [34] An Approach for Validating Quality of Datasets for Machine Learning
    Ding, Junhua
    Li, XinChuan
    2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 2795 - 2803
  • [35] A machine learning approach for corrosion small datasets
    Totok Sutojo
    Supriadi Rustad
    Muhamad Akrom
    Abdul Syukur
    Guruh Fajar Shidik
    Hermawan Kresno Dipojono
    npj Materials Degradation, 7
  • [36] Privacy Budgeting for Growing Machine Learning Datasets
    Li, Weiting
    Xiang, Liyao
    Zhou, Zhou
    Peng, Feng
    IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2021), 2021,
  • [37] A machine learning approach for corrosion small datasets
    Sutojo, Totok
    Rustad, Supriadi
    Akrom, Muhamad
    Syukur, Abdul
    Shidik, Guruh Fajar
    Dipojono, Hermawan Kresno
    NPJ MATERIALS DEGRADATION, 2023, 7 (01)
  • [38] SliceLens: Guided Exploration of Machine Learning Datasets
    Kerrigan, Daniel
    Bertini, Enrico
    WORKSHOP ON HUMAN-IN-THE-LOOP DATA ANALYTICS, HILDA 2023, 2023,
  • [39] Machine Learning with Remote Sensing Image Datasets
    Petrovska, Biserka
    Atanasova-Pacemska, Tatjana
    Stojkovik, Natasa
    Stojanova, Aleksandra
    Kocaleva, Mirjana
    INFORMATICA-AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS, 2021, 45 (03): : 347 - 358
  • [40] A Hybrid Machine Learning Methodology for Imbalanced Datasets
    Lipitakis, Anastasia-Dimitra
    Kotsiantis, Sotirios
    5TH INTERNATIONAL CONFERENCE ON INFORMATION, INTELLIGENCE, SYSTEMS AND APPLICATIONS, IISA 2014, 2014, : 252 - +