The Influence of Feature Representation of Text on the Performance of Document Classification

被引:18
|
作者
Martincic-Ipsic, Sanda [1 ]
Milicic, Tanja [1 ]
Todorovski, Ljupco [2 ,3 ]
机构
[1] Univ Rijeka, Dept Informat, Radmile Matejcic 2, Rijeka 51000, Croatia
[2] Univ Ljubljana, Fac Publ Adm, Gosarjeva Ulica 5, Ljubljana 1000, Slovenia
[3] Jozef Stefan Inst, Dept Knowledge Technol, Jamova 39, Ljubljana 1000, Slovenia
来源
APPLIED SCIENCES-BASEL | 2019年 / 9卷 / 04期
关键词
document classification; bag-of-words; word2vec; doc2vec; graph-of-words; complex networks; KEYWORD EXTRACTION METHODS; LANGUAGE; CLASSIFIERS; FRAMEWORK; ALGORITHM; NETWORK;
D O I
10.3390/app9040743
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
In this paper we perform a comparative analysis of three models for a feature representation of text documents in the context of document classification. In particular, we consider the most often used family of bag-of-words models, the recently proposed continuous space models word2vec and doc2vec, and the model based on the representation of text documents as language networks. While the bag-of-word models have been extensively used for the document classification task, the performance of the other two models for the same task have not been well understood. This is especially true for the network-based models that have been rarely considered for the representation of text documents for classification. In this study, we measure the performance of the document classifiers trained using the method of random forests for features generated with the three models and their variants. Multi-objective rankings are proposed as the framework for multi-criteria comparative analysis of the results. Finally, the results of the empirical comparison show that the commonly used bag-of-words model has a performance comparable to the one obtained by the emerging continuous-space model of doc2vec. In particular, the low-dimensional variants of doc2vec generating up to 75 features are among the top-performing document representation models. The results finally point out that doc2vec shows a superior performance in the tasks of classifying large documents.
引用
收藏
页数:27
相关论文
共 50 条
  • [1] USING CONCEPTUAL DOCUMENT REPRESENTATION FOR MULTILINGUAL TEXT CLASSIFICATION
    Borges Garcia, A.
    Castro Castro, D.
    Ortega-Bueno, R.
    [J]. HOLOS, 2018, 34 (02) : 386 - 396
  • [2] Weighted Document Frequency for Feature Selection in Text Classification
    Li, Baoli
    Yan, Qiuling
    Xu, Zhenqiang
    Wang, Guicai
    [J]. PROCEEDINGS OF 2015 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING, 2015, : 132 - 135
  • [3] Investigating Optimal Feature Selection Method to Improve the Performance of Amharic Text Document Classification
    Alemu, Tamir Anteneh
    Tegegnie, Alemu Kumilachew
    [J]. AFRICAN JOURNAL OF LIBRARY ARCHIVES AND INFORMATION SCIENCE, 2019, 29 (02): : 103 - 113
  • [4] Interactions between document representation and feature selection in text categorization
    Radovanovic, Milos
    Ivanovic, Mirjana
    [J]. DATABASE AND EXPERT SYSTEMS APPLICATIONS, PROCEEDINGS, 2006, 4080 : 489 - 498
  • [5] Text Classification via iVector Based Feature Representation
    Zha, Shengxin
    Peng, Xujun
    Cao, Huaigu
    Zhuang, Xiaodan
    Natarajan, Pradeep
    Natarajan, Prem
    [J]. 2014 11TH IAPR INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS 2014), 2014, : 151 - 155
  • [6] Bag-of-Concepts Document Representation for Bayesian Text Classification
    Mourino-Garcia, Marcos
    Perez-Rodriguez, Roberto
    Anido-Rifon, Luis
    Gomez-Carballa, Miguel
    [J]. 2016 IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND INFORMATION TECHNOLOGY (CIT), 2016, : 281 - 288
  • [7] Improved Document Feature Selection with Categorical Parameter for Text Classification
    Wang, Fen
    Li, Xiaoxuan
    Huang, Xiaotao
    Kang, Ling
    [J]. MOBILE, SECURE, AND PROGRAMMABLE NETWORKING (MSPN 2016), 2016, 10026 : 86 - 98
  • [8] Cluster Based Symbolic Representation and Feature Selection for Text Classification
    Harish, B. S.
    Guru, D. S.
    Manjunath, S.
    Dinesh, R.
    [J]. ADVANCED DATA MINING AND APPLICATIONS (ADMA 2010), PT II, 2010, 6441 : 158 - 166
  • [9] Knowledge transfer based on feature representation mapping for text classification
    Meng, Jiana
    Lin, Hongfei
    Li, Yanpeng
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2011, 38 (08) : 10562 - 10567
  • [10] Short Text Classification using Wikipedia Concept based Document Representation
    Wang, Xiang
    Chen, Ruhua
    Jia, Yan
    Zhou, Bin
    [J]. 2013 INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY AND APPLICATIONS (ITA), 2013, : 471 - 474