A CASE STUDY IN TEXT MINING OF DISCUSSION FORUM POSTS: CLASSIFICATION WITH BAG OF WORDS AND GLOBAL VECTORS

被引:13
|
作者
Cichosz, Pawel [1 ]
机构
[1] Warsaw Univ Technol, Inst Comp Sci, Nowowiejska 15-19, PL-00665 Warsaw, Poland
关键词
text mining; discussion forums; text representation; document classification; word embedding; ONLINE;
D O I
10.2478/amcs-2018-0060
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Despite the rapid growth of other types of social media, Internet discussion forums remain a highly popular communication channel and a useful source of text data for analyzing user interests and sentiments. Being suited to richer, deeper, and longer discussions than microblogging services, they particularly well reflect topics of long-term, persisting involvement and areas of specialized knowledge or experience. Discovering and characterizing such topics and areas by text mining algorithms is therefore an interesting and useful research direction. This work presents a case study in which selected classification algorithms are applied to posts from a Polish discussion forum devoted to psychoactive substances received from home-grown plants, such as hashish or marijuana. The utility of two different vector text representations is examined: the simple bag of words representation and the more refined embedded global vectors one. While the former is found to work well for the multinomial naive Bayes algorithm, the latter turns out more useful for other classification algorithms: logistic regression, SVMs, and random forests. The obtained results suggest that post-classification can be applied for measuring publication intensity of particular topics and, in the case of forums related to psychoactive substances, for monitoring the risk of drug-related crime.
引用
收藏
页码:787 / 801
页数:15
相关论文
共 50 条
  • [1] Anomaly detection in discussion forum posts using Global Vectors
    Cichosz, Pawel
    PHOTONICS APPLICATIONS IN ASTRONOMY, COMMUNICATIONS, INDUSTRY, AND HIGH-ENERGY PHYSICS EXPERIMENTS 2018, 2018, 10808
  • [2] Automatic Classification of Forum Posts: A Finnish Online Health Discussion Forum Case
    Gencoglu, O.
    EMBEC & NBC 2017, 2018, 65 : 169 - 172
  • [4] Albanian Text Classification: Bag of Words Model and Word Analogies
    Kadriu, Arbana
    Abazi, Lejla
    Abazi, Hyrije
    BUSINESS SYSTEMS RESEARCH JOURNAL, 2019, 10 (01): : 74 - 87
  • [5] Text Mining in Hotel Reviews: Impact of Words Restriction in Text Classification
    Campos, Diogo
    Silva, Rodrigo Rocha
    Bernardino, Jorge
    KDIR: PROCEEDINGS OF THE 11TH INTERNATIONAL JOINT CONFERENCE ON KNOWLEDGE DISCOVERY, KNOWLEDGE ENGINEERING AND KNOWLEDGE MANAGEMENT - VOL 1: KDIR, 2019, : 442 - 449
  • [7] The influence of preprocessing on text classification using a bag-of-words representation
    HaCohen-Kerner, Yaakov
    Miller, Daniel
    Yigal, Yair
    PLOS ONE, 2020, 15 (05):
  • [8] Network-Based Bag-of-Words Model for Text Classification
    Yan, Dongyang
    Li, Keping
    Gu, Shuang
    Yang, Liu
    IEEE ACCESS, 2020, 8 : 82641 - 82652
  • [9] A New Text Representation Scheme Combining Bag-of-Words and Bag-of-Concepts Approaches for Automatic Text Classification
    Alahmadi, Alaa
    Joorabchi, Arash
    Mahdi, Abdulhussain E.
    2013 7TH IEEE GCC CONFERENCE AND EXHIBITION (GCC), 2013, : 108 - 113
  • [10] A Personality Mining System for German Twitter Posts With Global Vectors Word Embedding
    Usselmann, Henning
    Ahmad, Rangina
    Siemon, Dominik
    IEEE ACCESS, 2021, 9 : 165576 - 165610