Term weighting scheme for short-text classification: Twitter corpuses

被引:40
|
作者
Alsmadi, Issa [1 ]
Hoon, Gan Keng [1 ]
机构
[1] Univ Sains Malaysia, Sch Comp Sci, Gelugor 11800, Pulau Pinang, Malaysia
来源
NEURAL COMPUTING & APPLICATIONS | 2019年 / 31卷 / 08期
关键词
Short text; Classification; Term weighting; Social networks; Twitter; Machine learning; FEATURE-SELECTION; CATEGORIZATION;
D O I
10.1007/s00521-017-3298-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Term weighting is a well-known preprocessing step in text classification that assigns appropriate weights to each term in all documents to enhance the performance of text classification. Most methods proposed in the literature use traditional approaches that emphasize term frequency. These methods perform reasonably with traditional documents. However, these approaches are unsuitable for social network data with limited length and where sparsity and noise are characteristics of short text. A simple supervised term weighting approach, i.e., SW, which considers the special nature of short texts based on term strength and term distribution, is introduced in these study, and its effect in a high-dimensional vector space over term weighting schemes, which represent baseline term weighting in traditional text classification, are assessed. Two datasets are employed with support vector machine, decision tree, k-nearest neighbor, and logistic regression algorithms. The first dataset, Sanders dataset, is a benchmark dataset that includes approximately 5000 tweets in four categories. The second self-collected dataset contains roughly 1500 tweets distributed in six classes collected using Twitter API. The evaluation applied tenfold cross-validation on the labeled data to compare the proposed approach with state-of-the-art methods. The experimental results indicate that supervised approaches perform varied performance, predominantly better than the unsupervised approaches. However, the proposed approach SW has better performance than other ones in terms of accuracy. SW can deal with the limitations of short texts and mitigate the limitations of traditional approaches in the literature, thus improving performance to 80.83 and 90.64 (F-measure) on Sanders dataset and a self-collected dataset, respectively.
引用
收藏
页码:3819 / 3831
页数:13
相关论文
共 50 条
  • [1] Term weighting scheme for short-text classification: Twitter corpuses
    Issa Alsmadi
    Gan Keng Hoon
    [J]. Neural Computing and Applications, 2019, 31 : 3819 - 3831
  • [2] Review of short-text classification
    Alsmadi, Issa
    Gan, Keng Hoon
    [J]. INTERNATIONAL JOURNAL OF WEB INFORMATION SYSTEMS, 2019, 15 (02) : 155 - 182
  • [3] An improved term weighting scheme for text classification
    Tang, Zhong
    Li, Wenqiang
    Li, Yan
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2020, 32 (09):
  • [4] A Term Weighting Scheme Approach for Vietnamese Text Classification
    Vu Thanh Nguyen
    Nguyen Tri Hai
    Nguyen Hoang Nghia
    Tuan Dinh Le
    [J]. FUTURE DATA AND SECURITY ENGINEERING, FDSE 2015, 2015, 9446 : 46 - 53
  • [5] A Novel Term Weighting Scheme for Imbalanced Text Classification
    Tantisripreecha, Tanapon
    Soonthornphisaj, Nuanwan
    [J]. Informatica (Slovenia), 2022, 46 (02): : 259 - 268
  • [6] A Novel Term Weighting Scheme for Imbalanced Text Classification
    Tantisripreecha, Tanapon
    Soonthornphisaj, Nuanwan
    [J]. INFORMATICA-AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS, 2022, 46 (02): : 259 - 268
  • [7] An improved supervised term weighting scheme for text representation and classification
    Tang, Zhong
    Li, Wenqiang
    Li, Yan
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2022, 189
  • [8] A probabilistic model derived term weighting scheme for text classification
    Feng, Guozhong
    Li, Shaoting
    Sun, Tieli
    Zhang, Bangzuo
    [J]. PATTERN RECOGNITION LETTERS, 2018, 110 : 23 - 29
  • [9] Short-text classification based on ICA and LSA
    Pu, Qiang
    Yang, Guo-Wei
    [J]. ADVANCES IN NEURAL NETWORKS - ISNN 2006, PT 2, PROCEEDINGS, 2006, 3972 : 265 - 270
  • [10] Intent Classification of Short-Text on Social Media
    Purohit, Hemant
    Dong, Guozhu
    Shalin, Valerie
    Thirunarayan, Krishnaprasad
    Sheth, Amit
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON SMART CITY/SOCIALCOM/SUSTAINCOM (SMARTCITY), 2015, : 222 - 228