A parallel and balanced SVM algorithm on spark for data-intensive computing

被引：1

作者：

Li, Jianjiang ^{[1
]}

Shi, Jinliang ^{[1
]}

Liu, Zhiguo ^{[2
]}

Feng, Can ^{[1
]}

机构：

[1] Univ Sci & Technol Beijing, Sch Comp & Commun Engn, Beijing, Peoples R China

[2] Meituan, Beijing, Peoples R China

来源：

INTELLIGENT DATA ANALYSIS | 2023年 / 27卷 / 04期

基金：

国家重点研发计划;

关键词：

SVM; data mining; machine learning; data skew; spark; SUPPORT VECTOR MACHINES; REGRESSION;

D O I：

10.3233/IDA-226774

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Support Vector Machine (SVM) is a machine learning with excellent classification performance, which has been widely used in various fields such as data mining, text classification, face recognition and etc. However, when data volume scales to a certain level, the computational time becomes too long and the efficiency becomes low. To address this issue, we propose a parallel balanced SVM algorithm based on Spark, named PB-SVM, which is optimized on the basis of the traditional Cascade SVM algorithm. PB-SVM contains three parts, i.e., Clustering Equal Division, Balancing Shuffle and Iteration Termination, which solves the problems of data skew of Cascade SVM and the large difference between local support vector and global support vector. We implement PB-SVM in AliCloud Spark distributed cluster with five kinds of public datasets. Our experimental results show that in the two-classification test on the dataset covtype, compared with MLlib-SVM and Cascade SVM on Spark, PB-SVM improves efficiency by 38.9% and 75.4%, and the accuracy is improved by 7.16% and 8.38%. Moreover, in the multi-classification test, compared with Cascade SVM on Spark on the dataset covtype, PB-SVM improves efficiency and accuracy by 94.8% and 18.26% respectively.

引用

页码：1065 / 1086

页数：22

共 50 条

[31] A Resistive TCAM Accelerator for Data-Intensive Computing
Guo, Qing
Guo, Xiaochen
Bai, Yuxin
Ipek, Engin
PROCEEDINGS OF THE 2011 44TH ANNUAL IEEE/ACM INTERNATIONAL SYMPOSIUM ON MICROARCHITECTURE (MICRO 44), 2011, : 339 - 350
[32] PARROT: AN APPLICATION ENVIRONMENT FOR DATA-INTENSIVE COMPUTING
Thain, Douglas
Livny, Miron
SCALABLE COMPUTING-PRACTICE AND EXPERIENCE, 2005, 6 (03): : 9 - 18
[33] Fault Tolerant Parallel Data-Intensive Algorithms
Kutlu, Mucahid
Agrawal, Gagan
Kurt, Oguz
2012 19TH INTERNATIONAL CONFERENCE ON HIGH PERFORMANCE COMPUTING (HIPC), 2012,
[34] Parallel Optimization for Data-Intensive Service Composition
Deng, Shuiguang
Huang, Longtao
Wu, Bin
Xiong, Lirong
JOURNAL OF INTERNET TECHNOLOGY, 2013, 14 (05): : 817 - 824
[35] Improvement Of Data Throughput In Data-Intensive Cloud Computing Applications
Ibrahim, Ibrahim Adel
Bassiouni, Mostafa
2019 IEEE FIFTH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (IEEE BIGDATASERVICE 2019), 2019, : 49 - 54
[36] In-Memory Data Rearrangement for Irregular, Data-Intensive Computing
Lloyd, Scott
Gokhale, Maya
COMPUTER, 2015, 48 (08) : 18 - 25
[37] Data Allocation with Neural Similarity Estimation for Data-Intensive Computing
Vamosi, Ralf
Schikuta, Erich
COMPUTATIONAL SCIENCE - ICCS 2022, PT III, 2022, 13352 : 534 - 546
[38] Accelerating Data-Intensive Applications: A Cloud Computing Approach to Parallel Image Pattern Recognition Tasks
Han, Liangxiu
Saengngam, Tantana
van Hemert, Jano
PROCEEDINGS OF THE FOURTH INTERNATIONAL CONFERENCE ON ADVANCED ENGINEERING COMPUTING AND APPLICATIONS IN SCIENCES (ADVCOMP 2010), 2010, : 148 - 153
[39] An Improved Bayesian Inference Method for Data-Intensive Computing
Ma, Feng
Liu, Weiyi
COMPUTATIONAL INTELLIGENCE AND INTELLIGENT SYSTEMS, 2012, 316 : 134 - 144
[40] Innovative methods and algorithms for advanced data-intensive computing
Cuzzocrea, A. (cuzzocrea@si.deis.unical.it), 1600, Elsevier B.V. (37):

← 1 2 3 4 5 →