Discriminative features for text document classification

被引:0
|
作者
K. Torkkola
机构
[1] Motorola Labs,
关键词
Dimension reduction; Linear discriminant analysis; Random transforms; Text classification;
D O I
暂无
中图分类号
学科分类号
摘要
The bag-of-words approach to text document representation typically results in vectors of the order of 5000–20,000 components as the representation of documents. To make effective use of various statistical classifiers, it may be necessary to reduce the dimensionality of this representation. We point out deficiencies in class discrimination of two popular such methods, Latent Semantic Indexing (LSI), and sequential feature selection according to some relevant criterion. As a remedy, we suggest feature transforms based on Linear Discriminant Analysis (LDA). Since LDA requires operating both with large and dense matrices, we propose an efficient intermediate dimension reduction step using either a random transform or LSI. We report good classification results with the combined feature transform on a subset of the Reuters-21578 database. Drastic reduction of the feature vector dimensionality from 5000 to 12 actually improves the classification performance.
引用
收藏
页码:301 / 308
页数:7
相关论文
共 50 条
  • [31] A New Method of Automatic Text Document Classification
    Yatsko, V. A.
    AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS, 2021, 55 (03) : 122 - 133
  • [32] Document segmentation and classification into musical scores and text
    Pedersoli, Fabrizio
    Tzanetakis, George
    INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2016, 19 (04) : 289 - 304
  • [33] Integrating Rich Document Representations for Text Classification
    Jiang, Suqi
    Lewris, Jason
    Voltmer, Michael
    Wang, Hongning
    2016 IEEE SYSTEMS AND INFORMATION ENGINEERING DESIGN SYMPOSIUM (SIEDS), 2016, : 303 - 308
  • [34] Hybrid Text Mining Model for Document Classification
    Vidhya, K. A.
    Aghila, G.
    2010 2ND INTERNATIONAL CONFERENCE ON COMPUTER AND AUTOMATION ENGINEERING (ICCAE 2010), VOL 1, 2010, : 210 - 214
  • [35] Text document classification based on mixture models
    Novovicová, J
    Malík, A
    KYBERNETIKA, 2004, 40 (03) : 293 - 304
  • [36] The Problems and Methods of Automatic Text Document Classification
    Yatsko, V. A.
    AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS, 2021, 55 (06) : 274 - 285
  • [37] Document segmentation and classification into musical scores and text
    Fabrizio Pedersoli
    George Tzanetakis
    International Journal on Document Analysis and Recognition (IJDAR), 2016, 19 : 289 - 304
  • [38] A New Method of Automatic Text Document Classification
    V. A. Yatsko
    Automatic Documentation and Mathematical Linguistics, 2021, 55 : 122 - 133
  • [39] The Problems and Methods of Automatic Text Document Classification
    V. A. Yatsko
    Automatic Documentation and Mathematical Linguistics, 2021, 55 : 274 - 285
  • [40] Prompt tuning discriminative language models for hierarchical text classification
    du Toit, Jaco
    Dunaiski, Marcel
    NATURAL LANGUAGE PROCESSING, 2024,