Discriminative features for text document classification

被引:0
|
作者
K. Torkkola
机构
[1] Motorola Labs,
关键词
Dimension reduction; Linear discriminant analysis; Random transforms; Text classification;
D O I
暂无
中图分类号
学科分类号
摘要
The bag-of-words approach to text document representation typically results in vectors of the order of 5000–20,000 components as the representation of documents. To make effective use of various statistical classifiers, it may be necessary to reduce the dimensionality of this representation. We point out deficiencies in class discrimination of two popular such methods, Latent Semantic Indexing (LSI), and sequential feature selection according to some relevant criterion. As a remedy, we suggest feature transforms based on Linear Discriminant Analysis (LDA). Since LDA requires operating both with large and dense matrices, we propose an efficient intermediate dimension reduction step using either a random transform or LSI. We report good classification results with the combined feature transform on a subset of the Reuters-21578 database. Drastic reduction of the feature vector dimensionality from 5000 to 12 actually improves the classification performance.
引用
收藏
页码:301 / 308
页数:7
相关论文
共 50 条
  • [1] Discriminative features for text document classification
    Torkkola, K
    PATTERN ANALYSIS AND APPLICATIONS, 2003, 6 (04) : 301 - 308
  • [2] Discriminative features for document classification
    Torkkola, K
    16TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL I, PROCEEDINGS, 2002, : 472 - 475
  • [3] Discriminative features for document classification
    Torkkola, Kari
    Proceedings - International Conference on Pattern Recognition, 2002, 16 (01): : 472 - 475
  • [5] Discriminative category matching: Efficient text classification for huge document collections
    Fung, GPC
    Yu, JX
    Lu, HJ
    2002 IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2002, : 187 - 194
  • [6] Text Document Classification
    Novovicova, Jana
    ERCIM NEWS, 2005, (62): : 53 - 54
  • [7] Using Class Based Document Frequency to Select Features in Text Classification
    Li, Baoli
    Yan, Qiuling
    Han, Liping
    BIG DATA TECHNOLOGY AND APPLICATIONS, 2016, 590 : 200 - 210
  • [8] SYSTEMATICS FEATURES OF A DOCUMENT TEXT
    Kosova, M. V.
    VESTNIK VOLGOGRADSKOGO GOSUDARSTVENNOGO UNIVERSITETA-SERIYA 2-YAZYKOZNANIE, 2012, 11 (01): : 7 - 11
  • [9] Text classification with document embeddings
    Huang, Chaochao (chaochaohuang12@fudan.edu.cn), 1600, Springer Verlag (8801):
  • [10] Text Classification with Document Embeddings
    Huang, Chaochao
    Qiu, Xipeng
    Huang, Xuanjing
    CHINESE COMPUTATIONAL LINGUISTICS AND NATURAL LANGUAGE PROCESSING BASED ON NATURALLY ANNOTATED BIG DATA, CCL 2014, 2014, 8801 : 131 - 140