Fast Training on Large Genomics Data using Distributed Support Vector Machines

被引:0
|
作者
Theera-Ampornpunt, Nawanol [1 ]
Kim, Seong Gon [1 ]
Ghoshal, Asish [1 ]
Bagchi, Saurabh [1 ]
Grama, Ananth [1 ]
Chaterji, Somali [1 ]
机构
[1] Purdue Univ, W Lafayette, IN 47907 USA
关键词
machine learning; classifier training; computational genomics; computational cost; network cost; CHIP-SEQ; TRANSCRIPTION; PREDICTION; ELEMENTS; ENHANCER; SIGNATURES;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The field of genomics has seen a glorious explosion of high-quality data, with tremendous strides having been made in genomic sequencing instruments and computational genomics applications meant to make sense of the data. A common use case for genomics data is to answer the question if a specific genetic signature is correlated with some disease manifestations. Support Vector Machine (SVM) is a widely used classifier in computational literature. Previous studies have shown success in using these SVMs for the above use case of genomics data. However, SVMs suffer from a widely-recognized scalability problem in both memory use and computational time. It is as yet an unanswered question if training such classifiers can scale to the massive sizes that characterize many of the genomics data sets. We answer that question here for a specific dataset, in order to decipher whether some regulatory module of a particular combinatorial epigenetic "pattern" will regulate the expression of a gene. However, the specifics of the dataset is likely of less relevance to the claims of our work. We take a proposed theoretical technique for efficient training of SVM, namely Cascade SVM, create our classifier called EP-SVM, and empirically evaluate how it scales to the large genomics dataset. We implement Cascade SVM on the Apache Spark platform and open source this implementation(1). Through our evaluation, we bring out the computational cost on each application process, the way of distributing the overall workload among multiple processes, which can potentially execute on different cores or different machines, and the cost of data transfer to different cores or different machines. We believe we are the first to shed light on the computational and network costs of training an SVM on a multi-dimensional genomics dataset. We also evaluate the accuracy of the classifier result as a function of the parameters of the SVM model.
引用
收藏
页数:8
相关论文
共 50 条
  • [41] Using support vector machines for mining regression classes in large data sets
    Sun, ZH
    Gao, LX
    Sun, YX
    2002 IEEE REGION 10 CONFERENCE ON COMPUTERS, COMMUNICATIONS, CONTROL AND POWER ENGINEERING, VOLS I-III, PROCEEDINGS, 2002, : 89 - 92
  • [42] Fast classification for large data sets via random selection clustering and Support Vector Machines
    Li, Xiaoou
    Cervantes, Jair
    Yu, Wen
    INTELLIGENT DATA ANALYSIS, 2012, 16 (06) : 897 - 914
  • [43] Incremental training of support vector machines using hyperspheres
    Katagiri, Shinya
    Abe, Shigeo
    PATTERN RECOGNITION LETTERS, 2006, 27 (13) : 1495 - 1507
  • [44] Feature Selection and Fast Training of Subspace Based Support Vector Machines
    Kitamura, Takuya
    Takeuchi, Syogo
    Abe, Shigeo
    2010 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS IJCNN 2010, 2010,
  • [45] Training support vector machines: an application to welllog data classification
    Yan, H
    Zhang, XG
    Zhang, XD
    2000 5TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS, VOLS I-III, 2000, : 1427 - 1431
  • [46] Training algorithms for fuzzy support vector machines with noisy data
    Lin, CF
    Wang, SD
    2003 IEEE XIII WORKSHOP ON NEURAL NETWORKS FOR SIGNAL PROCESSING - NNSP'03, 2003, : 517 - 526
  • [47] A Memetic Algorithm to Select Training Data for Support Vector Machines
    Nalepa, Jakub
    Kawulok, Michal
    GECCO'14: PROCEEDINGS OF THE 2014 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE, 2014, : 573 - 580
  • [48] Training Support Vector Machines with privacy-protected data
    Gonzalez-Serrano, Francisco-Javier
    Navia-Vazquez, Angel
    Amor-Martin, Adrian
    PATTERN RECOGNITION, 2017, 72 : 93 - 107
  • [49] Linear Evolutionary Support Vector Machines for Separable Training Data
    Stoean, Ruxandra
    Dumitrescu, Dumitru
    ANNALS OF THE UNIVERSITY OF CRAIOVA-MATHEMATICS AND COMPUTER SCIENCE SERIES, 2006, 33 : 141 - 146
  • [50] Training algorithms for fuzzy support vector machines with noisy data
    Lin, CF
    Wang, SD
    PATTERN RECOGNITION LETTERS, 2004, 25 (14) : 1647 - 1656