Bias analysis in text classification for highly skewed data

被引:0
|
作者
Tang, L [1 ]
Liu, H [1 ]
机构
[1] Arizona State Univ, Dept Comp Sci & Engn, Tempe, AZ 85287 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Feature selection is often applied to high-dimensional data as a preprocessing step in text classfication. When dealing with highly skewed data, we observe that typical feature selection metrics like information gain or chi-squared are biased toward selecting features for the minor class, and the metric of bi-normal separation can select features for both minor and major classes. In this work, we investigate how these feature selection metrics impact on the performance of frequently used classifiers such as Decision Trees, Naive Bayes, and Support Vector Machines via bias analysis for highly skewed data. Three types of biases are metric bias, class bias, and classifier bias. Extensive experiments are designed to understand how these biases can be employed in concert and efficiently to achieve good classification performance. We report our findings and present recommended approaches to text classification based on bias analysis and the empirical study.
引用
收藏
页码:781 / 784
页数:4
相关论文
共 50 条
  • [1] An analytical study of the classification of highly skewed data
    Siddiqui, Fatima
    Ali, Qazi M.
    COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2017, 46 (10) : 7582 - 7601
  • [2] Comparison of Feature Selection Methods in Text Classification on Highly Skewed Datasets
    Asim, Muhammad Nabeel
    Wasim, Muhammad
    Ali, Muhammad Sajid
    Rehman, Abdur
    2017 FIRST INTERNATIONAL CONFERENCE ON LATEST TRENDS IN ELECTRICAL ENGINEERING AND COMPUTING TECHNOLOGIES (INTELLECT), 2017,
  • [3] A Novel Inherent Distinguishing Feature Selector for Highly Skewed Text Document Classification
    Ali, Muhammad Sajid
    Javed, Kashif
    ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2020, 45 (12) : 10471 - 10491
  • [4] A Novel Inherent Distinguishing Feature Selector for Highly Skewed Text Document Classification
    Muhammad Sajid Ali
    Kashif Javed
    Arabian Journal for Science and Engineering, 2020, 45 : 10471 - 10491
  • [5] Statistical analysis of highly skewed immune response data
    McGuinness, D
    Bennett, S
    Riley, E
    JOURNAL OF IMMUNOLOGICAL METHODS, 1997, 201 (01) : 99 - 114
  • [6] Accurate SVM text classification for highly skewed data using threshold tuning and query-expansion-based feature selection
    Goertzel, Ben
    Venuto, James
    2006 IEEE INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORK PROCEEDINGS, VOLS 1-10, 2006, : 1220 - +
  • [7] Robust classification for skewed data
    Mia Hubert
    Stephan Van der Veeken
    Advances in Data Analysis and Classification, 2010, 4 : 239 - 254
  • [8] Robust classification for skewed data
    Hubert, Mia
    Van der Veeken, Stephan
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2010, 4 (04) : 239 - 254
  • [9] A Novel Term-weighting Approach in Text Classification over Skewed Data Sets
    Sun, Tieli
    Zhang, Yujie
    Yang, Fengqin
    Yang, Xiquan
    Jiang, Yingjie
    Wang, Zibing
    Li, Kuiwu
    INFORMATION-AN INTERNATIONAL INTERDISCIPLINARY JOURNAL, 2010, 13 (03): : 621 - 633
  • [10] Random Forest Based Multiclass Classification Approach for Highly Skewed Particle Data
    Kuzu, Serpil Yalcin
    JOURNAL OF SCIENTIFIC COMPUTING, 2023, 95 (01)