Are n-gram Categories Helpful in Text Classification?

被引:10
|
作者
Kruczek, Jakub [1 ]
Kruczek, Paulina [1 ]
Kuta, Marcin [1 ]
机构
[1] AGH Univ Sci & Technol, Fac Comp Sci Elect & Telecommun, Dept Comp Sci, Al Mickiewicza 30, PL-30059 Krakow, Poland
来源
关键词
Character n-grams; Typed n-grams; Authorship attribution; Author profiling; Sentiment analysis;
D O I
10.1007/978-3-030-50417-5_39
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Character n-grams are widely used in text categorization problems and are the single most successful type of feature in authorship attribution. Their primary advantage is language independence, as they can be applied to a new language with no additional effort. Typed character n-grams reflect information about their content and context. According to previous research, typed character n-grams improve the accuracy of authorship attribution. This paper examines their effectiveness in three domains: authorship attribution, author profiling and sentiment analysis. The problem of a very high number of features is tackled with distributed Apache Spark processing.
引用
收藏
页码:524 / 537
页数:14
相关论文
共 50 条
  • [1] A Neural N-Gram Network for Text Classification
    Yan, Zhenguo
    Wu, Yue
    [J]. JOURNAL OF ADVANCED COMPUTATIONAL INTELLIGENCE AND INTELLIGENT INFORMATICS, 2018, 22 (03) : 380 - 386
  • [2] Japanese text classification using N-gram and the maximum ratio of term frequency among categories
    Suzuki, Makoto
    [J]. PROCEDINGS OF THE 11TH IASTED INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, 2007, : 197 - 202
  • [3] n-BiLSTM: BiLSTM with n-gram Features for Text Classification
    Zhang, Yunxiang
    Rao, Zhuyi
    [J]. PROCEEDINGS OF 2020 IEEE 5TH INFORMATION TECHNOLOGY AND MECHATRONICS ENGINEERING CONFERENCE (ITOEC 2020), 2020, : 1056 - 1059
  • [4] Automatic Chinese Text Classification Using N-Gram Model
    Yen, Show-Jane
    Lee, Yue-Shi
    Wu, Yu-Chieh
    Ying, Jia-Ching
    Tseng, Vincent S.
    [J]. COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2010, PT 3, PROCEEDINGS, 2010, 6018 : 458 - +
  • [5] A Short Text Classification Method Based on N-Gram and CNN
    WANG Haitao
    HE Jie
    ZHANG Xiaohong
    LIU Shufen
    [J]. Chinese Journal of Electronics, 2020, 29 (02) : 248 - 254
  • [6] A Short Text Classification Method Based on N-Gram and CNN
    Wang, Haitao
    He, Jie
    Zhang, Xiaohong
    Liu, Shufen
    [J]. CHINESE JOURNAL OF ELECTRONICS, 2020, 29 (02) : 248 - 254
  • [7] An ensemble text classification model combining strong rules and N-Gram
    Liu, Jinhong
    Lu, Yuliang
    [J]. ICNC 2007: THIRD INTERNATIONAL CONFERENCE ON NATURAL COMPUTATION, VOL 3, PROCEEDINGS, 2007, : 535 - +
  • [8] Combining naive Bayes and n-gram language models for text classification
    Peng, FC
    Schuurmans, D
    [J]. ADVANCES IN INFORMATION RETRIEVAL, 2003, 2633 : 335 - 350
  • [9] Text mining with n-gram variables
    Schonlau, Matthias
    Guenther, Nick
    Sucholutsky, Ilia
    [J]. STATA JOURNAL, 2017, 17 (04): : 866 - 881
  • [10] SEARCHING FOR TEXT - SEND AN N-GRAM
    KIMBRELL, RE
    [J]. BYTE, 1988, 13 (05): : 297 - &