An Improved Random Forest Classifier for Text Categorization

被引:102
|
作者
Xu, Baoxun [1 ]
Guo, Xiufeng [2 ]
Ye, Yunming [1 ]
Cheng, Jiefeng [3 ]
机构
[1] Harbin Inst Technol, Shenzhen Grad Sch, Shenzhen 518055, Peoples R China
[2] Henan Business Coll, Dept Comp Sci, Zhengzhou 450045, Peoples R China
[3] Chinese Acad Sci, Shenzhen Inst Adv Technol, Shenzhen 518055, Peoples R China
关键词
random forest; text categorization; random subspace; decision tree;
D O I
10.4304/jcp.7.12.2913-2920
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper proposes an improved random forest algorithm for classifying text data. This algorithm is particularly designed for analyzing very high dimensional data with multiple classes whose well-known representative data is text corpus. A novel feature weighting method and tree selection method are developed and synergistically served for making random forest framework well suited to categorize text documents with dozens of topics. With the new feature weighting method for subspace sampling and tree selection method, we can effectively reduce subspace size and improve classification performance without increasing error bound. We apply the proposed method on six text data sets with diverse characteristics. The results have demonstrated that this improved random forests outperformed the popular text classification methods in terms of classification performance.
引用
收藏
页码:2913 / 2920
页数:8
相关论文
共 50 条
  • [1] An improved centroid classifier for text categorization
    Tan, Songbo
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2008, 35 (1-2) : 279 - 285
  • [2] An improved text classifier based on random forest algorithm comparative studies on multiple text classifiers
    Xin, Luo
    [J]. PROCEEDINGS OF THE 2017 4TH INTERNATIONAL CONFERENCE ON MACHINERY, MATERIALS AND COMPUTER (MACMC 2017), 2017, 150 : 175 - 178
  • [3] An Improved Random Forest Classifier for Image Classification
    Xu, Baoxun
    Ye, Yunming
    Nie, Lei
    [J]. PROCEEDING OF THE IEEE INTERNATIONAL CONFERENCE ON INFORMATION AND AUTOMATION, 2012, : 795 - 800
  • [4] Text Categorization using Rocchio Algorithm and Random Forest Algorithm
    Selvi, Thamarai S.
    Karthikeyan, P.
    Vincent, A.
    Abinaya, V
    Neeraja, G.
    Deepika, R.
    [J]. 2016 EIGHTH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING (ICOAC), 2017, : 7 - 12
  • [5] FORESTEXTER: An efficient random forest algorithm for imbalanced text categorization
    Wu, Qingyao
    Ye, Yunming
    Zhang, Haijun
    Ng, Michael K.
    Ho, Shen-Shyang
    [J]. KNOWLEDGE-BASED SYSTEMS, 2014, 67 : 105 - 116
  • [6] A New Associative Classifier for Text Categorization
    Su, Zhitong
    Song, Wei
    Meng, Dan
    Li, Jinhong
    [J]. 2008 3RD INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEM AND KNOWLEDGE ENGINEERING, VOLS 1 AND 2, 2008, : 291 - 295
  • [7] Newspaper text recognition of Gurumukhi script using random forest classifier
    Rupinder Pal Kaur
    Munish Kumar
    M. K. Jindal
    [J]. Multimedia Tools and Applications, 2020, 79 : 7435 - 7448
  • [8] Newspaper text recognition of Gurumukhi script using random forest classifier
    Kaur, Rupinder Pal
    Kumar, Munish
    Jindal, M. K.
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (11-12) : 7435 - 7448
  • [9] Detection and categorization of acute intracranial hemorrhage subtypes using a multilayer DenseNet-ResNet architecture with improved random forest classifier
    Monica Jenefer, Balraj M.
    Senathipathi, K.
    Aarthi
    Annapandi
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2022, 34 (22):
  • [10] Improving linear classifier for Chinese text categorization
    Tsay, JJ
    Wang, JD
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2004, 40 (02) : 223 - 237