Protein Sequence Classification Using Feature Hashing

被引:16
|
作者
Caragea, Cornelia [1 ]
Silvescu, Adrian [2 ]
Mitra, Prasenjit [1 ]
机构
[1] Penn State Univ, Informat Sci & Technol, University Pk, PA 16802 USA
[2] Naviance Inc, Oakland, CA USA
基金
美国国家科学基金会;
关键词
feature hashing; variable length k-grams; dimensionality reduction; PREDICTION;
D O I
10.1109/BIBM.2011.91
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Recent advances in next-generation sequencing technologies have resulted in an exponential increase in protein sequence data. The k-gram representation, used for protein sequence classification, usually results in prohibitively high dimensional input spaces, for large values of k. Applying data mining algorithms to these input spaces may be intractable due to the large number of dimensions. Hence, using dimensionality reduction techniques can be crucial for the performance and the complexity of the learning algorithms. We study the applicability of feature hashing to protein sequence classification, where the original high-dimensional space is "reduced" by mapping features to hash keys, such that multiple features can be mapped (at random) to the same key, and "aggregating" their counts. We compare feature hashing with the "bag of kgrams" and feature selection approaches. Our results show that feature hashing is an effective approach to reducing dimensionality on protein sequence classification tasks.
引用
收藏
页码:538 / 543
页数:6
相关论文
共 50 条
  • [41] A Feature Fusion Framework for Hashing
    Jhuo, I-Hong
    Weng, Li
    Cheng, Wen-Huang
    Lee, D. T.
    2016 23RD INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2016, : 2288 - 2293
  • [42] Triplet encoded sequence based membrane protein classification using BiLSTM
    S. Gomathi
    K. Nithish Ram
    N. Ani Brown Mary
    Multimedia Tools and Applications, 2024, 83 (36) : 84251 - 84273
  • [43] Multi-class protein sequence classification using fuzzy ARTMAP
    Mohamed, Shakir
    Rubin, David
    Marwala, Tshilidzi
    2006 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-6, PROCEEDINGS, 2006, : 1676 - +
  • [44] Motif-based protein sequence classification using neural networks
    Blekas, K
    Fotiadis, DI
    Likas, A
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2005, 12 (01) : 64 - 82
  • [45] Protein Sequence Classification Using Natural Vector and Convex Hull Method
    Wang, Yi
    Tian, Kun
    Yau, Stephen S. -T.
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2019, 26 (04) : 315 - 321
  • [46] Sequence motif identification and protein family classification using probabilistic trees
    Leonardi, F
    Galves, A
    ADVANCES IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, PROCEEDINGS, 2005, 3594 : 190 - 193
  • [47] STUDY ON FEATURE SEQUENCE CLASSIFICATION AND PHONETIC RECOGNITION.
    Chiba, Shigeru
    Denshi Gijutsu Sogo Kenkyusho Iho/Bulletin of the Electrotechnical Laboratory, 1988, 52 (03): : 88 - 93
  • [48] Texture feature extraction and selection for classification of images in a sequence
    Win, K
    Baik, S
    Baik, R
    Ahn, S
    Kim, S
    Jo, Y
    COMBINATORIAL IMAGE ANALYSIS, PROCEEDINGS, 2004, 3322 : 750 - 757
  • [49] A feature-based trust sequence classification algorithm
    Yahyaoui, Hamdi
    Al-Mutairi, Aisha
    INFORMATION SCIENCES, 2016, 328 : 455 - 484
  • [50] Predicting DNA-binding protein and coronavirus protein flexibility using protein dihedral angle and sequence feature
    Wang, Wei
    Su, Xili
    Liu, Dong
    Zhang, Hongjun
    Wang, Xianfang
    Zhou, Yun
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2023, 91 (04) : 497 - 507