A parallel and balanced SVM algorithm on spark for data-intensive computing

被引:1
|
作者
Li, Jianjiang [1 ]
Shi, Jinliang [1 ]
Liu, Zhiguo [2 ]
Feng, Can [1 ]
机构
[1] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Beijing, Peoples R China
[2] Meituan, Beijing, Peoples R China
基金
国家重点研发计划;
关键词
SVM; data mining; machine learning; data skew; spark; SUPPORT VECTOR MACHINES; REGRESSION;
D O I
10.3233/IDA-226774
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Support Vector Machine (SVM) is a machine learning with excellent classification performance, which has been widely used in various fields such as data mining, text classification, face recognition and etc. However, when data volume scales to a certain level, the computational time becomes too long and the efficiency becomes low. To address this issue, we propose a parallel balanced SVM algorithm based on Spark, named PB-SVM, which is optimized on the basis of the traditional Cascade SVM algorithm. PB-SVM contains three parts, i.e., Clustering Equal Division, Balancing Shuffle and Iteration Termination, which solves the problems of data skew of Cascade SVM and the large difference between local support vector and global support vector. We implement PB-SVM in AliCloud Spark distributed cluster with five kinds of public datasets. Our experimental results show that in the two-classification test on the dataset covtype, compared with MLlib-SVM and Cascade SVM on Spark, PB-SVM improves efficiency by 38.9% and 75.4%, and the accuracy is improved by 7.16% and 8.38%. Moreover, in the multi-classification test, compared with Cascade SVM on Spark on the dataset covtype, PB-SVM improves efficiency and accuracy by 94.8% and 18.26% respectively.
引用
收藏
页码:1065 / 1086
页数:22
相关论文
共 50 条
  • [1] Parallel Framework for Data-Intensive Computing with XSEDE
    Subramanian, Ranjini
    Zhang, Hui
    PEARC '19: PROCEEDINGS OF THE PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING ON RISE OF THE MACHINES (LEARNING), 2019,
  • [2] Data-Intensive Computing Modules for Teaching Parallel and Distributed Computing
    Gowanlock, Michael
    Gallet, Benoit
    2021 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2021, : 350 - 357
  • [3] Data classification algorithm for data-intensive computing environments
    Chen, Tiedong
    Liu, Shifeng
    Gong, Daqing
    Gao, Honghu
    EURASIP JOURNAL ON WIRELESS COMMUNICATIONS AND NETWORKING, 2017,
  • [4] Data classification algorithm for data-intensive computing environments
    Tiedong Chen
    Shifeng Liu
    Daqing Gong
    Honghu Gao
    EURASIP Journal on Wireless Communications and Networking, 2017
  • [5] Load-balanced data layout approach in data-intensive computing
    Song, J. (songjie@mail.neu.edu.cn), 1600, Beijing University of Posts and Telecommunications (36):
  • [6] A New Data Classification Algorithm for Data-Intensive Computing Environments
    Deng, Qizhi
    Zhang, Longbo
    Qian, Xin
    Chen, Yali
    Wang, Fengying
    PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION APPLICATIONS (ICCIA 2012), 2012, : 1351 - 1354
  • [7] A Data-Intensive Workflow Scheduling Algorithm for Grid Computing
    Xu, Meng
    Cui, Lizhen
    Wang, Haiyang
    Bi, Yanbing
    Bian, Ji
    FOURTH CHINAGRID ANNUAL CONFERENCE, PROCEEDINGS, 2009, : 110 - 115
  • [8] Applications in Data-Intensive Computing
    Shah, Anuj R.
    Adkins, Joshua N.
    Baxter, Douglas J.
    Cannon, William R.
    Chavarria-Miranda, Daniel G.
    Choudhury, Sutanay
    Gorton, Ian
    Gracio, Deborah K.
    Halter, Todd D.
    Jaitly, Navdeep D.
    Johnson, John R.
    Kouzes, Richard T.
    Macduff, Matthew C.
    Marquez, Andres
    Monroe, Matthew E.
    Oehmen, Christopher S.
    Pike, William A.
    Scherrer, Chad
    Villa, Oreste
    Webb-Robertson, Bobbie-Jo
    Whitney, Paul D.
    Zuljevic, Nino
    ADVANCES IN COMPUTERS, VOL 79, 2010, 79 : 1 - 70
  • [9] Cooperative Job Scheduling and Data Allocation in Data-Intensive Parallel Computing Clusters
    Wang, Haoyu
    Liu, Guoxin
    Shen, Haiying
    IEEE TRANSACTIONS ON CLOUD COMPUTING, 2023, 11 (03) : 2392 - 2406
  • [10] Data-intensive workflow management: For clouds and data-intensive and scalable computing environments
    De Oliveira, Daniel C.M.
    Liu, Ji
    Pacitti, Esther
    Synthesis Lectures on Data Management, 2019, 14 (04): : 1 - 179