Term weighting scheme for short-text classification: Twitter corpuses

被引:40
|
作者
Alsmadi, Issa [1 ]
Hoon, Gan Keng [1 ]
机构
[1] Univ Sains Malaysia, Sch Comp Sci, Gelugor 11800, Pulau Pinang, Malaysia
来源
NEURAL COMPUTING & APPLICATIONS | 2019年 / 31卷 / 08期
关键词
Short text; Classification; Term weighting; Social networks; Twitter; Machine learning; FEATURE-SELECTION; CATEGORIZATION;
D O I
10.1007/s00521-017-3298-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Term weighting is a well-known preprocessing step in text classification that assigns appropriate weights to each term in all documents to enhance the performance of text classification. Most methods proposed in the literature use traditional approaches that emphasize term frequency. These methods perform reasonably with traditional documents. However, these approaches are unsuitable for social network data with limited length and where sparsity and noise are characteristics of short text. A simple supervised term weighting approach, i.e., SW, which considers the special nature of short texts based on term strength and term distribution, is introduced in these study, and its effect in a high-dimensional vector space over term weighting schemes, which represent baseline term weighting in traditional text classification, are assessed. Two datasets are employed with support vector machine, decision tree, k-nearest neighbor, and logistic regression algorithms. The first dataset, Sanders dataset, is a benchmark dataset that includes approximately 5000 tweets in four categories. The second self-collected dataset contains roughly 1500 tweets distributed in six classes collected using Twitter API. The evaluation applied tenfold cross-validation on the labeled data to compare the proposed approach with state-of-the-art methods. The experimental results indicate that supervised approaches perform varied performance, predominantly better than the unsupervised approaches. However, the proposed approach SW has better performance than other ones in terms of accuracy. SW can deal with the limitations of short texts and mitigate the limitations of traditional approaches in the literature, thus improving performance to 80.83 and 90.64 (F-measure) on Sanders dataset and a self-collected dataset, respectively.
引用
收藏
页码:3819 / 3831
页数:13
相关论文
共 50 条
  • [41] Customized term weighting scheme for document classification
    Benjamin, C. M. X.
    Woon, W. L.
    Wong, K. S. D.
    [J]. 2008 INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATION ENGINEERING, VOLS 1-3, 2008, : 294 - 299
  • [42] A NOVEL TERM WEIGHTING SCHEME MIDF FOR TEXT CATEGORIZATION
    Deisy, C.
    Gowri, M.
    Baskar, S.
    Kalaiarasi, S. M. A.
    Ramraj, N.
    [J]. JOURNAL OF ENGINEERING SCIENCE AND TECHNOLOGY, 2010, 5 (01) : 94 - 107
  • [43] An Improved Term Weighting Scheme for Sentiment Classification
    Zhang, Pu
    Wang, Yinghao
    Wang, Junxia
    Zeng, Xianhua
    Wang, Yong
    [J]. 2017 IEEE 2ND ADVANCED INFORMATION TECHNOLOGY, ELECTRONIC AND AUTOMATION CONTROL CONFERENCE (IAEAC), 2017, : 462 - 466
  • [44] A New Improved Term Weighting Scheme for Text Categorization
    Nguyen Pham Xuan
    Hieu Le Quang
    [J]. KNOWLEDGE AND SYSTEMS ENGINEERING (KSE 2013), VOL 1, 2014, 244 : 261 - 270
  • [45] A novel term weighting scheme for automated text categorization
    Xu, Hongzhi
    Li, Chunping
    [J]. PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS, 2007, : 759 - 764
  • [46] Short Text Classification in Twitter to Improve Information Filtering
    Sriram, Bharath
    Fuhry, David
    Demir, Engin
    Ferhatosmanoglu, Hakan
    Demirbas, Murat
    [J]. SIGIR 2010: PROCEEDINGS OF THE 33RD ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH DEVELOPMENT IN INFORMATION RETRIEVAL, 2010, : 841 - 842
  • [47] Verbal aggression detection on Twitter comments: convolutional neural network for short-text sentiment analysis
    Chen, Junyi
    Yan, Shankai
    Wong, Ka-Chun
    [J]. NEURAL COMPUTING & APPLICATIONS, 2020, 32 (15): : 10809 - 10818
  • [48] Verbal aggression detection on Twitter comments: convolutional neural network for short-text sentiment analysis
    Junyi Chen
    Shankai Yan
    Ka-Chun Wong
    [J]. Neural Computing and Applications, 2020, 32 : 10809 - 10818
  • [49] Few-shot short-text classification with language representations and centroid similarity
    Liu, Wenfu
    Pang, Jianmin
    Li, Nan
    Yue, Feng
    Liu, Guangming
    [J]. APPLIED INTELLIGENCE, 2023, 53 (07) : 8061 - 8072
  • [50] Leveraging Conceptualization for Short-Text Embedding
    Huang, Heyan
    Wang, Yashen
    Feng, Chong
    Liu, Zhirun
    Zhou, Qiang
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2018, 30 (07) : 1282 - 1295