Bias analysis in text classification for highly skewed data

被引:0
|
作者
Tang, L [1 ]
Liu, H [1 ]
机构
[1] Arizona State Univ, Dept Comp Sci & Engn, Tempe, AZ 85287 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection is often applied to high-dimensional data as a preprocessing step in text classfication. When dealing with highly skewed data, we observe that typical feature selection metrics like information gain or chi-squared are biased toward selecting features for the minor class, and the metric of bi-normal separation can select features for both minor and major classes. In this work, we investigate how these feature selection metrics impact on the performance of frequently used classifiers such as Decision Trees, Naive Bayes, and Support Vector Machines via bias analysis for highly skewed data. Three types of biases are metric bias, class bias, and classifier bias. Extensive experiments are designed to understand how these biases can be employed in concert and efficiently to achieve good classification performance. We report our findings and present recommended approaches to text classification based on bias analysis and the empirical study.
引用
收藏
页码:781 / 784
页数:4
相关论文
共 50 条
  • [41] Easy Adaptation to Mitigate Gender Bias in Multilingual Text Classification
    Huang, Xiaolei
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 717 - 723
  • [42] To transform or not to transform skewed data for psychometric analysis
    Norris, AE
    Aroian, KJ
    NURSING RESEARCH, 2004, 53 (01) : 67 - 71
  • [43] Semiparametric Regression Analysis of Longitudinal Skewed Data
    Lin, Huazhen
    Zhou, Ling
    Zhou, Xiaohua
    SCANDINAVIAN JOURNAL OF STATISTICS, 2014, 41 (04) : 1031 - 1050
  • [44] Expectile Matrix Factorization for Skewed Data Analysis
    Zhu, Rui
    Niu, Di
    Kong, Linglong
    Li, Zongpeng
    THIRTY-FIRST AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 259 - 265
  • [45] A LOAD BALANCED MULTICOMPUTER RELATIONAL DATABASE SYSTEM FOR HIGHLY SKEWED DATA
    BARLOS, F
    FRIEDER, O
    PARALLEL COMPUTING, 1995, 21 (09) : 1451 - 1483
  • [46] A Genetic Programming approach for feature selection in highly dimensional skewed data
    Viegas, Felipe
    Rocha, Leonardo
    Goncalves, Marcos
    Mourao, Fernando
    Sa, Giovanni
    Salles, Thiago
    Andrade, Guilherme
    Sandin, Isac
    NEUROCOMPUTING, 2018, 273 : 554 - 569
  • [47] BUFFER ANALYSIS FOR A DATA SHARING ENVIRONMENT WITH SKEWED DATA ACCESS
    DAN, A
    DIAS, DM
    YU, PS
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 1994, 6 (02) : 331 - 337
  • [48] Classification of Parkinson's Disease from Voice - Analysis of Data Selection Bias
    Brenner, Alexander
    Van Alen, Catharina Marie
    Plagwitz, Lucas
    Varghese, Julian
    CARING IS SHARING-EXPLOITING THE VALUE IN DATA FOR HEALTH AND INNOVATION-PROCEEDINGS OF MIE 2023, 2023, 302 : 127 - 128
  • [49] PSO-based method for SVM classification on skewed data sets
    Cervantes, Jair
    Garcia-Lamont, Farid
    Rodriguez-Mazahua, Lisbeth
    Lopez, Asdrubal
    Ruiz-Castilla, Jose
    Trueba, Adrian
    NEUROCOMPUTING, 2017, 228 : 187 - 197
  • [50] Handwritten text localization in skewed documents
    Kavallieratou, E
    Balcan, DC
    Popa, MF
    Fakotakis, N
    2001 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL I, PROCEEDINGS, 2001, : 1102 - 1105