Evaluation of Machine Learning Frameworks on Bank Marketing and Higgs Datasets

被引:4
|
作者
Shashidhara, Bhuvan M. [1 ]
Jain, Siddharth [1 ]
Rao, Vinay D. [1 ]
Patil, Nagamma [1 ]
Raghavendra, G. S. [1 ]
机构
[1] Natl Inst Technol Karnataka, Dept Informat Technol, Surathkal, India
关键词
Machine Learning Algorithms; Big Data; Parallel Execution; Distributed Computing; WEKA; Scikit-Learn; Apache Spark;
D O I
10.1109/ICACCE.2015.31
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Big data is an emerging field with different datasets of various sizes are being analyzed for potential applications. In parallel, many frameworks are being introduced where these datasets can be fed into machine learning algorithms. Though some experiments have been done to compare different machine learning algorithms on different data, these experiments have not been tested out on different platforms. Our research aims to compare two selected machine learning algorithms on data sets of different sizes deployed on different platforms like Weka, Scikit-Learn and Apache Spark. They are evaluated based on Training time, Accuracy and Root mean squared error. This comparison helps us to decide what platform is best suited to work while applying computationally expensive selected machine learning algorithms on a particular size of data. Experiments suggested that Scikit-Learn would be optimal on data which can fit into memory. While working with huge, data Apache Spark would be optimal as it performs parallel computations by distributing the data over a cluster. Hence this study concludes that spark platform which has growing support for parallel implementation of machine learning algorithms could be optimal to analyze big data.
引用
收藏
页码:551 / 555
页数:5
相关论文
共 50 条
  • [21] Datasets with rich labels for machine learning
    Hoarau, Arthur
    Thierry, Constance
    Martin, Arnaud
    Dubois, Jean-Christophe
    Le Gall, Yolande
    2023 IEEE INTERNATIONAL CONFERENCE ON FUZZY SYSTEMS, FUZZ, 2023,
  • [22] Image Watermarking for Machine Learning Datasets
    Maesen, Palle
    Isler, Devris
    Laoutaris, Nikolaos
    Erkin, Zekeriya
    PROCEEDINGS OF THE 2ND ACM DATA ECONOMY WORKSHOP, DEC 2023, 2023, : 7 - 13
  • [23] Morse Code Datasets for Machine Learning
    Dey, Sourya
    Chugg, Keith M.
    Beerel, Peter A.
    2018 9TH INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION AND NETWORKING TECHNOLOGIES (ICCCNT), 2018,
  • [24] QDataSet, quantum datasets for machine learning
    Perrier, Elija
    Youssry, Akram
    Ferrie, Chris
    SCIENTIFIC DATA, 2022, 9 (01)
  • [25] An Evaluation of Federated Learning Techniques for Secure and Privacy-Preserving Machine Learning on Medical Datasets
    Korkmaz, Abdulkadir
    Alhonainy, Ahmad
    Rao, Praveen
    2022 IEEE APPLIED IMAGERY PATTERN RECOGNITION WORKSHOP, AIPR, 2022,
  • [26] PERFORMANCE COMPARISON OF APACHE SPARK AND HADOOP FOR MACHINE LEARNING BASED ITERATIVE GBTR ON HIGGS AND COVID-19 DATASETS
    Sewal, Piyush
    Singh, Hari
    SCALABLE COMPUTING-PRACTICE AND EXPERIENCE, 2024, 25 (03): : 1373 - 1386
  • [27] PERFORMANCE COMPARISON OF APACHE SPARK AND HADOOP FOR MACHINE LEARNING BASED ITERATIVE GBTR ON HIGGS AND COVID-19 DATASETS
    Sewal, Piyush
    Singh, Hari
    Scalable Computing, 2024, 25 (03): : 1373 - 1386
  • [28] Frameworks for Developing Machine Learning Models
    Cichosz, Simon Lebech
    JOURNAL OF DIABETES SCIENCE AND TECHNOLOGY, 2023, 17 (03): : 862 - 863
  • [29] A comparative evaluation of machine learning ensemble approaches for disease prediction using multiple datasets
    Mahajan, Palak
    Uddin, Shahadat
    Hajati, Farshid
    Moni, Mohammad Ali
    Gide, Ergun
    HEALTH AND TECHNOLOGY, 2024, 14 (03) : 597 - 613
  • [30] A comparative evaluation of machine learning ensemble approaches for disease prediction using multiple datasets
    Palak Mahajan
    Shahadat Uddin
    Farshid Hajati
    Mohammad Ali Moni
    Ergun Gide
    Health and Technology, 2024, 14 : 597 - 613