Topic-based Classification through Unigram Unmasking

被引:3
|
作者
HaCohen-Kerner, Yaakov [1 ]
Rosenfeld, Avi [2 ]
Sabag, Asaf [1 ]
Tzidkani, Maor [1 ]
机构
[1] Jerusalem Coll ofTechnol, Dept Comp Sci, IL-9116001 Jerusalem, Israel
[2] Jerusalem Coll Technol, Dept Ind Engn, IL-9116001 Jerusalem, Israel
关键词
Bag of words; Overfitting Features; Supervised machine learning; Textual features; Text classification; Topic-based classification Unmasking; Word unigrams; STYLISTIC FEATURE SETS; HISTORICAL PERIOD; CLASSIFIERS; DOCUMENTS;
D O I
10.1016/j.procs.2018.07.210
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Text classification (TC) is the task of automatically assigning documents to a fixed number of categories. TC is an important component in many text applications such as text indexing, information extraction, information retrieval, text mining, and word sense disambiguation. In this paper, we present an alternative method of feature reduction - a concept we call unigram unmasking. Previous text classification approaches have typically focused on a "bag-of-words" vector. We posit that at times some of the most frequent unigrams, which have the greatest weight within these vectors, are not only unnecessary for classification, but can at times even hurt models' accuracy. We present an approach where a percentage of common unigrams are intentionally removed, thus "unmasking" the added value from less popular unigrams. We present results from a topic-based classification task (hundreds of online free text-books belonging to five domains: Career and Study Advice, Economics and Finance, IT Programming, Natural Sciences, Statistics sand Mathematics) and show that unmasking was helpful across several machine learning models with some models even benefiting from removing nearly 50% of the most frequent unigrams from the bag-of-word vectors. (C) 2018 The Authors. Published by Elsevier Ltd.
引用
收藏
页码:69 / 76
页数:8
相关论文
共 50 条
  • [41] Topic-based Defect Prediction (NIER Track)
    Tung Thanh Nguyen
    Nguyen, Tien N.
    Tu Minh Phuong
    [J]. 2011 33RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE), 2011, : 932 - 935
  • [42] Automatic image annotation based on topic-based smoothing
    Zhou, XD
    Ye, JY
    Chen, L
    Zhang, L
    Shi, BL
    [J]. INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING IDEAL 2005, PROCEEDINGS, 2005, 3578 : 86 - 93
  • [43] Topic-based influential user detection: a survey
    Rrubaa Panchendrarajan
    Akrati Saxena
    [J]. Applied Intelligence, 2023, 53 : 5998 - 6024
  • [44] Towards Topic-Based Trust in Social Networks
    Knap, Tomas
    Mlynkova, Irena
    [J]. UBIQUITOUS INTELLIGENCE AND COMPUTING, 2010, 6406 : 635 - 649
  • [45] Assessing topic-based users credibility in twitter
    Meddeb, Amna
    Ben Romdhane, Lotfi
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (23) : 63329 - 63351
  • [46] TCPM: Topic-based Clinical Pathway Mining
    Xu, Xiao
    Jin, Tao
    Wei, Zhijie
    Lv, Cheng
    Wang, Jianmin
    [J]. 2016 IEEE FIRST INTERNATIONAL CONFERENCE ON CONNECTED HEALTH: APPLICATIONS, SYSTEMS AND ENGINEERING TECHNOLOGIES (CHASE), 2016, : 292 - 301
  • [47] A Discriminative Approach to Topic-Based Citation Recommendation
    Tang, Jie
    Zhang, Jing
    [J]. ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2009, 5476 : 572 - 579
  • [48] Efficient Topic-based Unsupervised Name Disambiguation
    Song, Yang
    Huang, Jian
    Councill, Isaac G.
    Li, Jia
    Giles, C. Lee
    [J]. PROCEEDINGS OF THE 7TH ACM/IEE JOINT CONFERENCE ON DIGITAL LIBRARIES: BUILDING & SUSTAINING THE DIGITAL ENVIRONMENT, 2007, : 342 - +
  • [49] CATS: Customizable Abstractive Topic-based Summarization
    Bahrainian, Seyed Ali
    Zerveas, George
    Crestani, Fabio
    Eickhoff, Carsten
    [J]. ACM TRANSACTIONS ON INFORMATION SYSTEMS, 2022, 40 (01)
  • [50] Unsupervised Construction of Topic-based Twitter Lists
    de Villiers, Francois
    Hoffmann, McElory
    Kroon, Steve
    [J]. PROCEEDINGS OF 2012 ASE/IEEE INTERNATIONAL CONFERENCE ON PRIVACY, SECURITY, RISK AND TRUST AND 2012 ASE/IEEE INTERNATIONAL CONFERENCE ON SOCIAL COMPUTING (SOCIALCOM/PASSAT 2012), 2012, : 283 - 292