Helmholtz principle based supervised and unsupervised feature selection methods for text mining

被引:31
|
作者
Tutkan, Melike [1 ]
Ganiz, Murat Can [2 ]
Akyokus, Selim [1 ]
机构
[1] Dogus Univ, Dept Comp Engn, Istanbul, Turkey
[2] Marmara Univ, Dept Comp Engn, Istanbul, Turkey
关键词
Feature selection; Attribute selection; Machine learning; Text mining; Text classification; Helmholtz principle; SEMANTIC SMOOTHING METHOD; ALGORITHM;
D O I
10.1016/j.ipm.2016.03.007
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
One of the important problems in text classification is the high dimensionality of the feature space. Feature selection methods are used to reduce the dimensionality of the feature space by selecting the most valuable features for classification. Apart from reducing the dimensionality, feature selection methods have potential to improve text classifiers' performance both in terms of accuracy and time. Furthermore, it helps to build simpler and as a result more comprehensible models. In this study we propose new methods for feature selection from textual data, called Meaning Based Feature Selection (MBFS) which is based on the Helmholtz principle from the Gestalt theory of human perception which is used in image processing. The proposed approaches are extensively evaluated by their effect on the classification performance of two well-known classifiers on several datasets and compared with several feature selection algorithms commonly used in text mining. Our results demonstrate the value of the MBFS methods in terms of classification accuracy and execution time. (C) 2016 Elsevier Ltd. All rights reserved.
引用
收藏
页码:885 / 910
页数:26
相关论文
共 50 条
  • [21] Feature Selection and Feature Weight Estimate in Web Text Mining
    Pei, Zhili
    Qi, Jianhong
    Zhang, Xinhong
    Zhou, Yuxin
    Bai, Mingyu
    Wang, Qinghu
    Liu, Lisha
    Fan, Xiaojing
    Jiang, Mingyang
    [J]. 2ND INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY FOR EDUCATION (ICTE 2015), 2015, : 316 - 320
  • [22] Feature Selection Methods for Text Classification
    Dasgupta, Anirban
    Drineas, Petros
    Harb, Boulos
    Josifovski, Vanja
    Mahoney, Michael W.
    [J]. KDD-2007 PROCEEDINGS OF THE THIRTEENTH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, 2007, : 230 - +
  • [23] Feature selection for unsupervised and supervised inference: the emergence of sparsity in a weighted-based approach
    Wolf, L
    Shashua, A
    [J]. NINTH IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION, VOLS I AND II, PROCEEDINGS, 2003, : 378 - 384
  • [24] Feature selection for unsupervised and supervised inference: The emergence of sparsity in a weight-based approach
    Wolf, L
    Shashua, A
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2005, 6 : 1855 - 1887
  • [25] An Experimental Study on Unsupervised Clustering-based Feature Selection Methods
    Covoes, Thiago F.
    Hruschka, Eduardo R.
    [J]. 2009 9TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS, 2009, : 993 - 1000
  • [26] Exploring supervised and unsupervised methods to detect topics in biomedical text
    Minsuk Lee
    Weiqing Wang
    Hong Yu
    [J]. BMC Bioinformatics, 7
  • [27] Exploring supervised and unsupervised methods to detect topics in biomedical text
    Lee, Minsuk
    Wang, Weiqing
    Yu, Hong
    [J]. BMC BIOINFORMATICS, 2006, 7 (1)
  • [28] Unsupervised text feature selection by binary fire hawk optimizer for text clustering
    Msallam, Mohammed M.
    Bin Idris, Syahril Anuar
    [J]. CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS, 2024, 27 (06): : 7721 - 7740
  • [29] Feature Selection for Genomic Signal Processing: Unsupervised, Supervised, and Self-Supervised Scenarios
    Kung, S. Y.
    Luo, Yuhui
    Mak, Man-Wai
    [J]. JOURNAL OF SIGNAL PROCESSING SYSTEMS FOR SIGNAL IMAGE AND VIDEO TECHNOLOGY, 2010, 61 (01): : 3 - 20
  • [30] Feature Selection for Genomic Signal Processing: Unsupervised, Supervised, and Self-Supervised Scenarios
    S. Y. Kung
    Yuhui Luo
    Man-Wai Mak
    [J]. Journal of Signal Processing Systems, 2010, 61 : 3 - 20