Discriminative features for text document classification

被引：0

作者：

K. Torkkola

机构：

[1] Motorola Labs,

来源：

Formal Pattern Analysis & Applications | 2004年 / 6卷

关键词：

Dimension reduction; Linear discriminant analysis; Random transforms; Text classification;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

The bag-of-words approach to text document representation typically results in vectors of the order of 5000–20,000 components as the representation of documents. To make effective use of various statistical classifiers, it may be necessary to reduce the dimensionality of this representation. We point out deficiencies in class discrimination of two popular such methods, Latent Semantic Indexing (LSI), and sequential feature selection according to some relevant criterion. As a remedy, we suggest feature transforms based on Linear Discriminant Analysis (LDA). Since LDA requires operating both with large and dense matrices, we propose an efficient intermediate dimension reduction step using either a random transform or LSI. We report good classification results with the combined feature transform on a subset of the Reuters-21578 database. Drastic reduction of the feature vector dimensionality from 5000 to 12 actually improves the classification performance.

引用

页码：301 / 308

页数：7

共 50 条

[31] A New Method of Automatic Text Document Classification
Yatsko, V. A.
AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS, 2021, 55 (03) : 122 - 133
[32] Document segmentation and classification into musical scores and text
Pedersoli, Fabrizio
Tzanetakis, George
INTERNATIONAL JOURNAL ON DOCUMENT ANALYSIS AND RECOGNITION, 2016, 19 (04) : 289 - 304
[33] Integrating Rich Document Representations for Text Classification
Jiang, Suqi
Lewris, Jason
Voltmer, Michael
Wang, Hongning
2016 IEEE SYSTEMS AND INFORMATION ENGINEERING DESIGN SYMPOSIUM (SIEDS), 2016, : 303 - 308
[34] Hybrid Text Mining Model for Document Classification
Vidhya, K. A.
Aghila, G.
2010 2ND INTERNATIONAL CONFERENCE ON COMPUTER AND AUTOMATION ENGINEERING (ICCAE 2010), VOL 1, 2010, : 210 - 214
[35] Text document classification based on mixture models
Novovicová, J
Malík, A
KYBERNETIKA, 2004, 40 (03) : 293 - 304
[36] The Problems and Methods of Automatic Text Document Classification
Yatsko, V. A.
AUTOMATIC DOCUMENTATION AND MATHEMATICAL LINGUISTICS, 2021, 55 (06) : 274 - 285
[37] Document segmentation and classification into musical scores and text
Fabrizio Pedersoli
George Tzanetakis
International Journal on Document Analysis and Recognition (IJDAR), 2016, 19 : 289 - 304
[38] A New Method of Automatic Text Document Classification
V. A. Yatsko
Automatic Documentation and Mathematical Linguistics, 2021, 55 : 122 - 133
[39] The Problems and Methods of Automatic Text Document Classification
V. A. Yatsko
Automatic Documentation and Mathematical Linguistics, 2021, 55 : 274 - 285
[40] Prompt tuning discriminative language models for hierarchical text classification
du Toit, Jaco
Dunaiski, Marcel
NATURAL LANGUAGE PROCESSING, 2024,

← 1 2 3 4 5 →