Topic-based Classification through Unigram Unmasking

被引:3
|
作者
HaCohen-Kerner, Yaakov [1 ]
Rosenfeld, Avi [2 ]
Sabag, Asaf [1 ]
Tzidkani, Maor [1 ]
机构
[1] Jerusalem Coll ofTechnol, Dept Comp Sci, IL-9116001 Jerusalem, Israel
[2] Jerusalem Coll Technol, Dept Ind Engn, IL-9116001 Jerusalem, Israel
关键词
Bag of words; Overfitting Features; Supervised machine learning; Textual features; Text classification; Topic-based classification Unmasking; Word unigrams; STYLISTIC FEATURE SETS; HISTORICAL PERIOD; CLASSIFIERS; DOCUMENTS;
D O I
10.1016/j.procs.2018.07.210
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification (TC) is the task of automatically assigning documents to a fixed number of categories. TC is an important component in many text applications such as text indexing, information extraction, information retrieval, text mining, and word sense disambiguation. In this paper, we present an alternative method of feature reduction - a concept we call unigram unmasking. Previous text classification approaches have typically focused on a "bag-of-words" vector. We posit that at times some of the most frequent unigrams, which have the greatest weight within these vectors, are not only unnecessary for classification, but can at times even hurt models' accuracy. We present an approach where a percentage of common unigrams are intentionally removed, thus "unmasking" the added value from less popular unigrams. We present results from a topic-based classification task (hundreds of online free text-books belonging to five domains: Career and Study Advice, Economics and Finance, IT Programming, Natural Sciences, Statistics sand Mathematics) and show that unmasking was helpful across several machine learning models with some models even benefiting from removing nearly 50% of the most frequent unigrams from the bag-of-word vectors. (C) 2018 The Authors. Published by Elsevier Ltd.
引用
收藏
页码:69 / 76
页数:8
相关论文
共 50 条
  • [1] Topic-Based Instance and Feature Selection in Multilabel Classification
    Ma, Jianghong
    Chow, Tommy W. S.
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2022, 33 (01) : 315 - 329
  • [2] Learning topic-based mixture models for factored classification
    Chen, Qiong
    Mitchell, Tom M.
    [J]. INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE FOR MODELLING, CONTROL & AUTOMATION JOINTLY WITH INTERNATIONAL CONFERENCE ON INTELLIGENT AGENTS, WEB TECHNOLOGIES & INTERNET COMMERCE, VOL 1, PROCEEDINGS, 2006, : 25 - +
  • [3] Topic-based habitat classification using visual data
    Pizarro, Oscar
    Williams, Stefan B.
    Colquhoun, Jamie
    [J]. OCEANS 2009 - EUROPE, VOLS 1 AND 2, 2009, : 1320 - +
  • [4] Learning topic-based mixture models for factored classification
    Chen, Qiong
    Mitchell, Tom M.
    [J]. INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE FOR MODELLING, CONTROL & AUTOMATION JOINTLY WITH INTERNATIONAL CONFERENCE ON INTELLIGENT AGENTS, WEB TECHNOLOGIES & INTERNET COMMERCE, VOL 1, PROCEEDINGS, 2006, : 1114 - +
  • [5] Topic-Based Microblog Polarity Classification Based on Cascaded Model
    Liu, Quanchao
    Hu, Yue
    Lei, Yangfan
    Wei, Xiangpeng
    Liu, Guangyong
    Bi, Wei
    [J]. COMPUTATIONAL SCIENCE - ICCS 2018, PT II, 2018, 10861 : 206 - 220
  • [6] Feature selection for the topic-based mixture model in factored classification
    Chen, Qiong
    [J]. 2006 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND SECURITY, PTS 1 AND 2, PROCEEDINGS, 2006, : 39 - 44
  • [7] Topic-based classification and identification of global trends for startup companies
    Savin, Ivan
    Chukavina, Kristina
    Pushkarev, Andrey
    [J]. SMALL BUSINESS ECONOMICS, 2023, 60 (02) : 659 - 689
  • [8] Topic-based classification and identification of global trends for startup companies
    Ivan Savin
    Kristina Chukavina
    Andrey Pushkarev
    [J]. Small Business Economics, 2023, 60 : 659 - 689
  • [9] Sentiment Analysis on Twitter through Topic-Based Lexicon Expansion
    Zhou, Zhixin
    Zhang, Xiuzhen
    Sanderson, Mark
    [J]. DATABASES THEORY AND APPLICATIONS, ADC 2014, 2014, 8506 : 98 - 109
  • [10] Topic-Based Hierarchical Segmentation
    Chien, Jen-Tzung
    Chueh, Chuang-Hua
    [J]. IEEE TRANSACTIONS ON AUDIO SPEECH AND LANGUAGE PROCESSING, 2012, 20 (01): : 55 - 66