A parallel and balanced SVM algorithm on spark for data-intensive computing

被引:1
|
作者
Li, Jianjiang [1 ]
Shi, Jinliang [1 ]
Liu, Zhiguo [2 ]
Feng, Can [1 ]
机构
[1] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Beijing, Peoples R China
[2] Meituan, Beijing, Peoples R China
基金
国家重点研发计划;
关键词
SVM; data mining; machine learning; data skew; spark; SUPPORT VECTOR MACHINES; REGRESSION;
D O I
10.3233/IDA-226774
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Support Vector Machine (SVM) is a machine learning with excellent classification performance, which has been widely used in various fields such as data mining, text classification, face recognition and etc. However, when data volume scales to a certain level, the computational time becomes too long and the efficiency becomes low. To address this issue, we propose a parallel balanced SVM algorithm based on Spark, named PB-SVM, which is optimized on the basis of the traditional Cascade SVM algorithm. PB-SVM contains three parts, i.e., Clustering Equal Division, Balancing Shuffle and Iteration Termination, which solves the problems of data skew of Cascade SVM and the large difference between local support vector and global support vector. We implement PB-SVM in AliCloud Spark distributed cluster with five kinds of public datasets. Our experimental results show that in the two-classification test on the dataset covtype, compared with MLlib-SVM and Cascade SVM on Spark, PB-SVM improves efficiency by 38.9% and 75.4%, and the accuracy is improved by 7.16% and 8.38%. Moreover, in the multi-classification test, compared with Cascade SVM on Spark on the dataset covtype, PB-SVM improves efficiency and accuracy by 94.8% and 18.26% respectively.
引用
收藏
页码:1065 / 1086
页数:22
相关论文
共 50 条
  • [31] A Resistive TCAM Accelerator for Data-Intensive Computing
    Guo, Qing
    Guo, Xiaochen
    Bai, Yuxin
    Ipek, Engin
    PROCEEDINGS OF THE 2011 44TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO 44), 2011, : 339 - 350
  • [32] PARROT: AN APPLICATION ENVIRONMENT FOR DATA-INTENSIVE COMPUTING
    Thain, Douglas
    Livny, Miron
    SCALABLE COMPUTING-PRACTICE AND EXPERIENCE, 2005, 6 (03): : 9 - 18
  • [33] Fault Tolerant Parallel Data-Intensive Algorithms
    Kutlu, Mucahid
    Agrawal, Gagan
    Kurt, Oguz
    2012 19TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2012,
  • [34] Parallel Optimization for Data-Intensive Service Composition
    Deng, Shuiguang
    Huang, Longtao
    Wu, Bin
    Xiong, Lirong
    JOURNAL OF INTERNET TECHNOLOGY, 2013, 14 (05): : 817 - 824
  • [35] Improvement Of Data Throughput In Data-Intensive Cloud Computing Applications
    Ibrahim, Ibrahim Adel
    Bassiouni, Mostafa
    2019 IEEE FIFTH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (IEEE BIGDATASERVICE 2019), 2019, : 49 - 54
  • [36] In-Memory Data Rearrangement for Irregular, Data-Intensive Computing
    Lloyd, Scott
    Gokhale, Maya
    COMPUTER, 2015, 48 (08) : 18 - 25
  • [37] Data Allocation with Neural Similarity Estimation for Data-Intensive Computing
    Vamosi, Ralf
    Schikuta, Erich
    COMPUTATIONAL SCIENCE - ICCS 2022, PT III, 2022, 13352 : 534 - 546
  • [38] Accelerating Data-Intensive Applications: A Cloud Computing Approach to Parallel Image Pattern Recognition Tasks
    Han, Liangxiu
    Saengngam, Tantana
    van Hemert, Jano
    PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON ADVANCED ENGINEERING COMPUTING AND APPLICATIONS IN SCIENCES (ADVCOMP 2010), 2010, : 148 - 153
  • [39] An Improved Bayesian Inference Method for Data-Intensive Computing
    Ma, Feng
    Liu, Weiyi
    COMPUTATIONAL INTELLIGENCE AND INTELLIGENT SYSTEMS, 2012, 316 : 134 - 144
  • [40] Innovative methods and algorithms for advanced data-intensive computing
    Cuzzocrea, A. (cuzzocrea@si.deis.unical.it), 1600, Elsevier B.V. (37):