A parallel and balanced SVM algorithm on spark for data-intensive computing

被引：1

作者：

Li, Jianjiang ^{[1
]}

Shi, Jinliang ^{[1
]}

Liu, Zhiguo ^{[2
]}

Feng, Can ^{[1
]}

机构：

[1] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Beijing, Peoples R China

[2] Meituan, Beijing, Peoples R China

来源：

INTELLIGENT DATA ANALYSIS | 2023年 / 27卷 / 04期

基金：

国家重点研发计划;

关键词：

SVM; data mining; machine learning; data skew; spark; SUPPORT VECTOR MACHINES; REGRESSION;

D O I：

10.3233/IDA-226774

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Support Vector Machine (SVM) is a machine learning with excellent classification performance, which has been widely used in various fields such as data mining, text classification, face recognition and etc. However, when data volume scales to a certain level, the computational time becomes too long and the efficiency becomes low. To address this issue, we propose a parallel balanced SVM algorithm based on Spark, named PB-SVM, which is optimized on the basis of the traditional Cascade SVM algorithm. PB-SVM contains three parts, i.e., Clustering Equal Division, Balancing Shuffle and Iteration Termination, which solves the problems of data skew of Cascade SVM and the large difference between local support vector and global support vector. We implement PB-SVM in AliCloud Spark distributed cluster with five kinds of public datasets. Our experimental results show that in the two-classification test on the dataset covtype, compared with MLlib-SVM and Cascade SVM on Spark, PB-SVM improves efficiency by 38.9% and 75.4%, and the accuracy is improved by 7.16% and 8.38%. Moreover, in the multi-classification test, compared with Cascade SVM on Spark on the dataset covtype, PB-SVM improves efficiency and accuracy by 94.8% and 18.26% respectively.

引用

页码：1065 / 1086

页数：22

共 50 条

[1] Parallel Framework for Data-Intensive Computing with XSEDE
Subramanian, Ranjini
Zhang, Hui
PEARC '19: PROCEEDINGS OF THE PRACTICE AND EXPERIENCE IN ADVANCED RESEARCH COMPUTING ON RISE OF THE MACHINES (LEARNING), 2019,
[2] Data-Intensive Computing Modules for Teaching Parallel and Distributed Computing
Gowanlock, Michael
Gallet, Benoit
2021 IEEE INTERNATIONAL PARALLEL AND DISTRIBUTED PROCESSING SYMPOSIUM WORKSHOPS (IPDPSW), 2021, : 350 - 357
[3] Data classification algorithm for data-intensive computing environments
Chen, Tiedong
Liu, Shifeng
Gong, Daqing
Gao, Honghu
EURASIP JOURNAL ON WIRELESS COMMUNICATIONS AND NETWORKING, 2017,
[4] Data classification algorithm for data-intensive computing environments
Tiedong Chen
Shifeng Liu
Daqing Gong
Honghu Gao
EURASIP Journal on Wireless Communications and Networking, 2017
[5] Load-balanced data layout approach in data-intensive computing
Song, J. (songjie@mail.neu.edu.cn), 1600, Beijing University of Posts and Telecommunications (36):
[6] A New Data Classification Algorithm for Data-Intensive Computing Environments
Deng, Qizhi
Zhang, Longbo
Qian, Xin
Chen, Yali
Wang, Fengying
PROCEEDINGS OF THE 2ND INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION APPLICATIONS (ICCIA 2012), 2012, : 1351 - 1354
[7] A Data-Intensive Workflow Scheduling Algorithm for Grid Computing
Xu, Meng
Cui, Lizhen
Wang, Haiyang
Bi, Yanbing
Bian, Ji
FOURTH CHINAGRID ANNUAL CONFERENCE, PROCEEDINGS, 2009, : 110 - 115
[8] Applications in Data-Intensive Computing
Shah, Anuj R.
Adkins, Joshua N.
Baxter, Douglas J.
Cannon, William R.
Chavarria-Miranda, Daniel G.
Choudhury, Sutanay
Gorton, Ian
Gracio, Deborah K.
Halter, Todd D.
Jaitly, Navdeep D.
Johnson, John R.
Kouzes, Richard T.
Macduff, Matthew C.
Marquez, Andres
Monroe, Matthew E.
Oehmen, Christopher S.
Pike, William A.
Scherrer, Chad
Villa, Oreste
Webb-Robertson, Bobbie-Jo
Whitney, Paul D.
Zuljevic, Nino
ADVANCES IN COMPUTERS, VOL 79, 2010, 79 : 1 - 70
[9] Cooperative Job Scheduling and Data Allocation in Data-Intensive Parallel Computing Clusters
Wang, Haoyu
Liu, Guoxin
Shen, Haiying
IEEE TRANSACTIONS ON CLOUD COMPUTING, 2023, 11 (03) : 2392 - 2406
[10] Data-intensive workflow management: For clouds and data-intensive and scalable computing environments
De Oliveira, Daniel C.M.
Liu, Ji
Pacitti, Esther
Synthesis Lectures on Data Management, 2019, 14 (04): : 1 - 179

← 1 2 3 4 5 →