Protein Sequence Classification Using Feature Hashing

被引:16
|
作者
Caragea, Cornelia [1 ]
Silvescu, Adrian [2 ]
Mitra, Prasenjit [1 ]
机构
[1] Penn State Univ, Informat Sci & Technol, University Pk, PA 16802 USA
[2] Naviance Inc, Oakland, CA USA
基金
美国国家科学基金会;
关键词
feature hashing; variable length k-grams; dimensionality reduction; PREDICTION;
D O I
10.1109/BIBM.2011.91
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Recent advances in next-generation sequencing technologies have resulted in an exponential increase in protein sequence data. The k-gram representation, used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. We study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by mapping features to hash keys, such that multiple features can be mapped (at random) to the same key, and "aggregating" their counts. We compare feature hashing with the "bag of kgrams" and feature selection approaches. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.
引用
收藏
页码:538 / 543
页数:6
相关论文
共 50 条
  • [31] Efficient Multiple Feature Fusion With Hashing for Hyperspectral Imagery Classification: A Comparative Study
    Zhong, Zisha
    Fan, Bin
    Ding, Kun
    Li, Haichang
    Xiang, Shiming
    Pan, Chunhong
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2016, 54 (08): : 4461 - 4478
  • [32] Significance of Sequence Features in Classification of Protein–Protein Interactions Using Machine Learning
    Sini S. Raj
    S. S. Vinod Chandra
    The Protein Journal, 2024, 43 : 72 - 83
  • [33] Estimation of Position Specific Energy as a Feature of Protein Residues from Sequence Alone for Structural Classification
    Iqbal, Sumaiya
    Hoque, Md Tamjidul
    PLOS ONE, 2016, 11 (09):
  • [34] Classification of ligase function based on multi-parametric feature extracted from protein sequence
    Lee, Bum Ju
    Lee, Heon Gyu
    Shin, Moon Sun
    Ryu, Keun Ho
    COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2008, PT 2, PROCEEDINGS, 2008, 5073 : 1096 - +
  • [35] A face hashing algorithm using mutual information and feature fusion
    Zeng, Zhao
    Watters, Paul A.
    2007 IEEE INTERNATIONAL CONFERENCE ON NETWORKING, SENSING, AND CONTROL, VOLS 1 AND 2, 2007, : 386 - 391
  • [36] Perceptual Image Hashing Using Feature Fusion of Orthogonal Moments
    Li X.
    Wang Z.
    Feng G.
    Zhang X.
    Qin C.
    IEEE Transactions on Multimedia, 2024, 26 : 1 - 14
  • [37] Significance of Sequence Features in Classification of Protein-Protein Interactions Using Machine Learning
    Raj, Sini S.
    Chandra, S. S. Vinod
    PROTEIN JOURNAL, 2024, 43 (01): : 72 - 83
  • [38] Hyperspectral Image Classification Method Based on CNN Architecture Embedding With Hashing Semantic Feature
    Yu, Chunyan
    Zhao, Meng
    Song, Meiping
    Wang, Yulei
    Li, Fang
    Han, Rui
    Chang, Chein-I
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2019, 12 (06) : 1866 - 1881
  • [39] Predicting disulfide connectivity from protein sequence using multiple sequence feature vectors and secondary structure
    Song, Jiangning
    Yuan, Zheng
    Tan, Hao
    Huber, Thomas
    Burrage, Kevin
    BIOINFORMATICS, 2007, 23 (23) : 3147 - 3154
  • [40] Enzyme Function Classification using Protein Sequence Features and Random Forest
    Kumar, Chetan
    Li, Gang
    Choudhary, Alok
    2009 3RD INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICAL ENGINEERING, VOLS 1-11, 2009, : 764 - 767