Fast Training on Large Genomics Data using Distributed Support Vector Machines

被引:0
|
作者
Theera-Ampornpunt, Nawanol [1 ]
Kim, Seong Gon [1 ]
Ghoshal, Asish [1 ]
Bagchi, Saurabh [1 ]
Grama, Ananth [1 ]
Chaterji, Somali [1 ]
机构
[1] Purdue Univ, W Lafayette, IN 47907 USA
关键词
machine learning; classifier training; computational genomics; computational cost; network cost; CHIP-SEQ; TRANSCRIPTION; PREDICTION; ELEMENTS; ENHANCER; SIGNATURES;
D O I
暂无
中图分类号
TP3 [计算技术、计算机技术];
学科分类号
0812 ;
摘要
The field of genomics has seen a glorious explosion of high-quality data, with tremendous strides having been made in genomic sequencing instruments and computational genomics applications meant to make sense of the data. A common use case for genomics data is to answer the question if a specific genetic signature is correlated with some disease manifestations. Support Vector Machine (SVM) is a widely used classifier in computational literature. Previous studies have shown success in using these SVMs for the above use case of genomics data. However, SVMs suffer from a widely-recognized scalability problem in both memory use and computational time. It is as yet an unanswered question if training such classifiers can scale to the massive sizes that characterize many of the genomics data sets. We answer that question here for a specific dataset, in order to decipher whether some regulatory module of a particular combinatorial epigenetic "pattern" will regulate the expression of a gene. However, the specifics of the dataset is likely of less relevance to the claims of our work. We take a proposed theoretical technique for efficient training of SVM, namely Cascade SVM, create our classifier called EP-SVM, and empirically evaluate how it scales to the large genomics dataset. We implement Cascade SVM on the Apache Spark platform and open source this implementation(1). Through our evaluation, we bring out the computational cost on each application process, the way of distributing the overall workload among multiple processes, which can potentially execute on different cores or different machines, and the cost of data transfer to different cores or different machines. We believe we are the first to shed light on the computational and network costs of training an SVM on a multi-dimensional genomics dataset. We also evaluate the accuracy of the classifier result as a function of the parameters of the SVM model.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] Distributed Estimation of Support Vector Machines for Matrix Data
    Xu, Wangli
    Liu, Jiamin
    Lian, Heng
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (05) : 6643 - 6653
  • [22] Using the Leader Algorithm with Support Vector Machines for Large Data Sets
    Romero, Enrique
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2011, PT I, 2011, 6791 : 225 - 232
  • [23] Distributed QR decomposition framework for training Support Vector Machines
    Dass, Jyotikrishna
    Sakuru, V. N. S. Prithvi
    Sarin, Vivek
    Mahapatra, Rabi N.
    2017 IEEE 37TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2017), 2017, : 753 - 763
  • [24] Comments on the "Core Vector Machines: Fast SVM training on very large data sets"
    Loosli, Gaelle
    Canu, Stephane
    JOURNAL OF MACHINE LEARNING RESEARCH, 2007, 8 : 291 - 301
  • [25] Comments on the Core vector machines: Fast SVM training on very large data sets
    LITIS, INSA de Rouen, Avenue de l'Université, 76801 Saint-Etienne du Rouvray, France
    J. Mach. Learn. Res., 2007, (291-301):
  • [26] A Kernel Clustering Algorithm for Fast Training of Support Vector Machines
    刘笑嶂
    冯国灿
    JournalofDonghuaUniversity(EnglishEdition), 2011, 28 (01) : 53 - 56
  • [27] Fast Online Training of Ramp Loss Support Vector Machines
    Wang, Zhuang
    Vucetic, Slobodan
    2009 9TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING, 2009, : 569 - 577
  • [28] Training Data Selection for Support Vector Machines Model
    Dang Huu Nghi
    Luong Chi Mai
    INFORMATION AND ELECTRONICS ENGINEERING, 2011, 6 : 28 - 32
  • [29] Distributed support vector machines
    Navia-Vazquez, A.
    Gutierrez-Gonzalez, D.
    Parrado-Hernandez, E.
    Navarro-Abellan, J. J.
    IEEE TRANSACTIONS ON NEURAL NETWORKS, 2006, 17 (04): : 1091 - 1097
  • [30] Fast Training of Support Vector Machines Using Top-down Kernel Clustering
    Liu, Xiao-Zhang
    Qiu, Hui-Zhen
    2008 3RD INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEM AND KNOWLEDGE ENGINEERING, VOLS 1 AND 2, 2008, : 968 - +