Protein Sequence Classification Using Feature Hashing

被引:16
|
作者
Caragea, Cornelia [1 ]
Silvescu, Adrian [2 ]
Mitra, Prasenjit [1 ]
机构
[1] Penn State Univ, Informat Sci & Technol, University Pk, PA 16802 USA
[2] Naviance Inc, Oakland, CA USA
基金
美国国家科学基金会;
关键词
feature hashing; variable length k-grams; dimensionality reduction; PREDICTION;
D O I
10.1109/BIBM.2011.91
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Recent advances in next-generation sequencing technologies have resulted in an exponential increase in protein sequence data. The k-gram representation, used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. We study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by mapping features to hash keys, such that multiple features can be mapped (at random) to the same key, and "aggregating" their counts. We compare feature hashing with the "bag of kgrams" and feature selection approaches. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.
引用
收藏
页码:538 / 543
页数:6
相关论文
共 50 条
  • [1] Protein sequence classification using feature hashing
    Cornelia Caragea
    Adrian Silvescu
    Prasenjit Mitra
    Proteome Science, 10
  • [2] Protein sequence classification using feature hashing
    Caragea, Cornelia
    Silvescu, Adrian
    Mitra, Prasenjit
    PROTEOME SCIENCE, 2012, 10
  • [3] Effect of Feature Hashing on Fair Classification
    Dutta, Ritik
    Gohil, Varun
    Jain, Atishay
    PROCEEDINGS OF THE 7TH ACM IKDD CODS AND 25TH COMAD (CODS-COMAD 2020), 2020, : 365 - 366
  • [4] Efficient Feature Selection and Classification of Protein Sequence Data in Bioinformatics
    Iqbal, Muhammad Javed
    Faye, Ibrahima
    Samir, Brahim Belhaouari
    Said, Abas Md
    SCIENTIFIC WORLD JOURNAL, 2014,
  • [5] Classification of enzyme function from protein sequence based on feature representation
    Lee, Bum Ju
    Lee, Jong Yun
    Lee, Heon Gu
    Ryu, Keun Ho
    PROCEEDINGS OF THE 7TH IEEE INTERNATIONAL SYMPOSIUM ON BIOINFORMATICS AND BIOENGINEERING, VOLS I AND II, 2007, : 741 - +
  • [6] A Novel Technique of Feature Selection with ReliefF and CFS for Protein Sequence Classification
    Kaur, Kiranpreet
    Patil, Nagamma
    RECENT FINDINGS IN INTELLIGENT COMPUTING TECHNIQUES, VOL 1, 2019, 707 : 399 - 405
  • [7] Community Detection-Based Feature Construction for Protein Sequence Classification
    Tangirala, Karthik
    Herndon, Nic
    Caragea, Doina
    BIOINFORMATICS RESEARCH AND APPLICATIONS (ISBRA 2015), 2015, 9096 : 331 - 342
  • [8] novel feature selection based on apriori property and correlation analysis for protein sequence classification using MapReduce
    Bhavani, R.
    Sadasivam, G. Sudha
    INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2017, 17 (03) : 255 - 265
  • [9] Hashing Based Hierarchical Feature Representation for Hyperspectral Imagery Classification
    Pan, Bin
    Shi, Zhenwei
    Xu, Xia
    Yang, Yi
    REMOTE SENSING, 2017, 9 (11)
  • [10] Feature and Semantic Views Consensus Hashing for Image Set Classification
    Sun, Yuan
    Peng, Dezhong
    Huang, Haixiao
    Ren, Zhenwen
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 2097 - 2105