Bias analysis in text classification for highly skewed data

被引:0
|
作者
Tang, L [1 ]
Liu, H [1 ]
机构
[1] Arizona State Univ, Dept Comp Sci & Engn, Tempe, AZ 85287 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection is often applied to high-dimensional data as a preprocessing step in text classfication. When dealing with highly skewed data, we observe that typical feature selection metrics like information gain or chi-squared are biased toward selecting features for the minor class, and the metric of bi-normal separation can select features for both minor and major classes. In this work, we investigate how these feature selection metrics impact on the performance of frequently used classifiers such as Decision Trees, Naive Bayes, and Support Vector Machines via bias analysis for highly skewed data. Three types of biases are metric bias, class bias, and classifier bias. Extensive experiments are designed to understand how these biases can be employed in concert and efficiently to achieve good classification performance. We report our findings and present recommended approaches to text classification based on bias analysis and the empirical study.
引用
收藏
页码:781 / 784
页数:4
相关论文
共 50 条
  • [31] BLADE STRESS ANALYSIS OF A HIGHLY SKEWED PROPELLER.
    Soh, Tadashi
    Fujimoto, Toshio
    R and D: Research and Development Kobe Steel Engineering Reports, 1983, 33 (01): : 70 - 74
  • [32] Classification of Medical Sensitive Data based on Text Classification
    Jiang, Huimin
    Chen, Chunling
    Wu, ShengChen
    Guo, Yongan
    2019 IEEE INTERNATIONAL CONFERENCE ON CONSUMER ELECTRONICS - TAIWAN (ICCE-TW), 2019,
  • [33] The Text Classification for Imbalanced Data Sets
    Li, Yanling
    Zhu, Yehang
    Yang, Ping
    ISISE 2008: INTERNATIONAL SYMPOSIUM ON INFORMATION SCIENCE AND ENGINEERING, VOL 2, 2008, : 778 - +
  • [34] Dealing with Data Imbalance in Text Classification
    Padurariu, Cristian
    Breaban, Mihaela Elena
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KES 2019), 2019, 159 : 736 - 745
  • [35] Data Augmentation with Transformers for Text Classification
    Medardo Tapia-Tellez, Jose
    Jair Escalante, Hugo
    ADVANCES IN COMPUTATIONAL INTELLIGENCE, MICAI 2020, PT II, 2020, 12469 : 247 - 259
  • [36] A technology of text classification of data mining
    Yang, Bin
    Meng, Zhi-qing
    Xiangtan Daxue Ziran Kexue Xuebao, 2001, 23 (04): : 34 - 37
  • [37] Text Classification for Data Loss Preventionwa
    Hart, Michael
    Manadhata, Pratyusa
    Johnson, Rob
    PRIVACY ENHANCING TECHNOLOGIES, 2011, 6794 : 18 - +
  • [38] Training Data Cleaning for Text Classification
    Esuli, Andrea
    Sebastiani, Fabrizio
    ADVANCES IN INFORMATION RETRIEVAL THEORY, 2009, 5766 : 29 - 41
  • [39] A Survey on Data Augmentation for Text Classification
    Bayer, Markus
    Kaufhold, Marc-Andre
    Reuter, Christian
    ACM COMPUTING SURVEYS, 2023, 55 (07)
  • [40] Text classification based on the bias of word frequency over categories
    Suzuki, M
    PROCEEDINGS OF THE IASTED INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND APPLICATIONS, 2006, : 400 - 405