A parallel and balanced SVM algorithm on spark for data-intensive computing

被引:1
|
作者
Li, Jianjiang [1 ]
Shi, Jinliang [1 ]
Liu, Zhiguo [2 ]
Feng, Can [1 ]
机构
[1] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Beijing, Peoples R China
[2] Meituan, Beijing, Peoples R China
基金
国家重点研发计划;
关键词
SVM; data mining; machine learning; data skew; spark; SUPPORT VECTOR MACHINES; REGRESSION;
D O I
10.3233/IDA-226774
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Support Vector Machine (SVM) is a machine learning with excellent classification performance, which has been widely used in various fields such as data mining, text classification, face recognition and etc. However, when data volume scales to a certain level, the computational time becomes too long and the efficiency becomes low. To address this issue, we propose a parallel balanced SVM algorithm based on Spark, named PB-SVM, which is optimized on the basis of the traditional Cascade SVM algorithm. PB-SVM contains three parts, i.e., Clustering Equal Division, Balancing Shuffle and Iteration Termination, which solves the problems of data skew of Cascade SVM and the large difference between local support vector and global support vector. We implement PB-SVM in AliCloud Spark distributed cluster with five kinds of public datasets. Our experimental results show that in the two-classification test on the dataset covtype, compared with MLlib-SVM and Cascade SVM on Spark, PB-SVM improves efficiency by 38.9% and 75.4%, and the accuracy is improved by 7.16% and 8.38%. Moreover, in the multi-classification test, compared with Cascade SVM on Spark on the dataset covtype, PB-SVM improves efficiency and accuracy by 94.8% and 18.26% respectively.
引用
收藏
页码:1065 / 1086
页数:22
相关论文
共 50 条
  • [21] Parallel data-intensive algorithms and applications
    Talia, D
    Srimani, PK
    PARALLEL COMPUTING, 2002, 28 (05) : 669 - 671
  • [22] Data-Intensive Scalable Computing for Scientific Applications
    Bryant, Randal E.
    COMPUTING IN SCIENCE & ENGINEERING, 2011, 13 (06) : 25 - 33
  • [23] Research on the architecture of data-intensive computing platform
    Hou, Ke
    Zhang, Jing
    Fang, Xing
    Journal of Software Engineering, 2015, 9 (03): : 686 - 701
  • [24] The Benefits of Service Choreography for Data-intensive Computing
    Barker, Adam
    Besana, Paolo
    Robertson, David
    Weissman, Jon B.
    CLADE09: 7TH INTERNATIONAL WORKSHOP ON CHALLENGES OF LARGE APPLICATIONS IN DISTRIBUTED ENVIRONMENTS, 2009, : 1 - 10
  • [25] A Framework for Data-Intensive Computing with Cloud Bursting
    Bicer, Tekin
    Chiu, David
    Agrawal, Gagan
    2011 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING (CLUSTER), 2011, : 169 - 177
  • [26] Challenges and Opportunities for Data-Intensive Computing in the Cloud
    Jung, Eun-Sung
    Kettimuthu, Rajkumar
    COMPUTER, 2014, 47 (12) : 82 - 85
  • [27] Automated Debugging in Data-Intensive Scalable Computing
    Gulzar, Muhammad Ali
    Interlandi, Matteo
    Han, Xueyuan
    Li, Mingda
    Condie, Tyson
    Kim, Miryung
    PROCEEDINGS OF THE 2017 SYMPOSIUM ON CLOUD COMPUTING (SOCC '17), 2017, : 520 - 534
  • [28] Coordinating Green Clouds as Data-Intensive Computing
    Biran, Yahav
    Collins, George
    Liberatore, Joseph
    PROCEEDINGS 2016 EIGHTH ANNUAL IEEE GREEN TECHNOLOGIES CONFERENCE (GREENTECH 2016), 2016, : 130 - 135
  • [29] Real-Time Data-Intensive Computing
    Parkinson, Dilworth Y.
    Beattie, Keith
    Chen, Xian
    Correa, Joaquin
    Dart, Eli
    Daurer, Benedikt J.
    Deslippe, Jack R.
    Hexemer, Alexander
    Krishnan, Harinarayan
    MacDowell, Alastair A.
    Maia, Filipe R. N. C.
    Marchesini, Stefano
    Padmore, Howard A.
    Patton, Simon J.
    Perciano, Talita
    Sethian, James A.
    Shapiro, David
    Stromsness, Rune
    Tamura, Nobumichi
    Tierney, Brian L.
    Tull, Craig E.
    Ushizima, Daniela
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON SYNCHROTRON RADIATION INSTRUMENTATION (SRI2015), 2016, 1741
  • [30] Robinia-BLAST: An Extensible Parallel BLAST based on Data-intensive Distributed Computing
    Gu, Yang
    Huang, Zhenchun
    2014 IEEE 12TH INTERNATIONAL CONFERENCE ON DEPENDABLE, AUTONOMIC AND SECURE COMPUTING (DASC)/2014 IEEE 12TH INTERNATIONAL CONFERENCE ON EMBEDDED COMPUTING (EMBEDDEDCOM)/2014 IEEE 12TH INTERNATIONAL CONF ON PERVASIVE INTELLIGENCE AND COMPUTING (PICOM), 2014, : 1 - 6