Fast Training on Large Genomics Data using Distributed Support Vector Machines

被引：0

作者：

Theera-Ampornpunt, Nawanol ^{[1
]}

Kim, Seong Gon ^{[1
]}

Ghoshal, Asish ^{[1
]}

Bagchi, Saurabh ^{[1
]}

Grama, Ananth ^{[1
]}

Chaterji, Somali ^{[1
]}

机构：

[1] Purdue Univ, W Lafayette, IN 47907 USA

来源：

2016 8TH INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORKS (COMSNETS) | 2016年

关键词：

machine learning; classifier training; computational genomics; computational cost; network cost; CHIP-SEQ; TRANSCRIPTION; PREDICTION; ELEMENTS; ENHANCER; SIGNATURES;

D O I：

暂无

中图分类号：

TP3 [计算技术、计算机技术];

学科分类号：

0812 ;

摘要：

The field of genomics has seen a glorious explosion of high-quality data, with tremendous strides having been made in genomic sequencing instruments and computational genomics applications meant to make sense of the data. A common use case for genomics data is to answer the question if a specific genetic signature is correlated with some disease manifestations. Support Vector Machine (SVM) is a widely used classifier in computational literature. Previous studies have shown success in using these SVMs for the above use case of genomics data. However, SVMs suffer from a widely-recognized scalability problem in both memory use and computational time. It is as yet an unanswered question if training such classifiers can scale to the massive sizes that characterize many of the genomics data sets. We answer that question here for a specific dataset, in order to decipher whether some regulatory module of a particular combinatorial epigenetic "pattern" will regulate the expression of a gene. However, the specifics of the dataset is likely of less relevance to the claims of our work. We take a proposed theoretical technique for efficient training of SVM, namely Cascade SVM, create our classifier called EP-SVM, and empirically evaluate how it scales to the large genomics dataset. We implement Cascade SVM on the Apache Spark platform and open source this implementation(1). Through our evaluation, we bring out the computational cost on each application process, the way of distributing the overall workload among multiple processes, which can potentially execute on different cores or different machines, and the cost of data transfer to different cores or different machines. We believe we are the first to shed light on the computational and network costs of training an SVM on a multi-dimensional genomics dataset. We also evaluate the accuracy of the classifier result as a function of the parameters of the SVM model.

引用

页数：8

共 50 条

[31] Fast Training of Support Vector Machines Using Error-Center-Based Optimization
L. Meng
International Journal of Automation and Computing, 2005, (01) : 6 - 12
[32] Fast Training of Support Vector Machines Using Error-Center-Based Optimization
Meng, L.
Wu, Q. H.
INTERNATIONAL JOURNAL OF AUTOMATION AND COMPUTING, 2005, 2 (01) : 6 - 12
[33] Fast training of Support Vector Machines using error-center-based optimization
L. Meng
Q. H. Wu
International Journal of Automation and Computing, 2005, 2 (1) : 6 - 12
[34] Fast and Accurate Support Vector Machines on Large Scale Systems
Vishnu, Abhinav
Narasimhan, Jeyanthi
Holder, Lawrence
Kerbyson, Darren
Hoisie, Adolfy
2015 IEEE INTERNATIONAL CONFERENCE ON CLUSTER COMPUTING - CLUSTER 2015, 2015, : 110 - 119
[35] A fast iterative single data approach to training unconstrained least squares support vector machines
Li, Bing
Song, Shiji
Li, Kang
NEUROCOMPUTING, 2013, 115 : 31 - 38
[36] Effective training of support vector machines using extractive support vector algorithm
Yao, Chih-Chia
Yu, Pao-Ta
PROCEEDINGS OF 2007 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2007, : 1808 - +
[37] Support vector machines with fuzzy entropy for training with a large datasets
Wu, ZD
Gao, XB
Xie, WX
Yu, JP
2004 7TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS, VOLS 1-3, 2004, : 1439 - 1442
[38] Support vector machines with clustering for training with very large datasets
Evgeniou, T
Pontil, M
METHODS AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2002, 2308 : 346 - 354
[39] TRAINING PAIRWISE SUPPORT VECTOR MACHINES WITH LARGE SCALE DATASETS
Cumani, Sandro
Laface, Pietro
2014 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2014,
[40] Distributed data mining model based on Support Vector Machines
Ju, Chun-Hua
Guo, Fei-Peng
Xitong Gongcheng Lilun yu Shijian/System Engineering Theory and Practice, 2010, 30 (10): : 1855 - 1863

← 1 2 3 4 5 →