Stochastic Tokenization with a Language Model for Neural Text Classification

被引:0
|
作者
Hiraoka, Tatsuya [1 ]
Shindo, Hiroyuki [2 ]
Matsumoto, Yuji [2 ,3 ]
机构
[1] Tokyo Inst Technol, Tokyo, Japan
[2] Nara Inst Sci & Technol, Ikoma, Nara, Japan
[3] RIKEN Ctr Adv Intelligence Project AIP, Wako, Saitama, Japan
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For unsegmented languages such as Japanese and Chinese, tokenization of a sentence has a significant impact on the performance of text classification. Sentences are usually segmented with words or subwords by a morphological analyzer or byte pair encoding and then encoded with word (or subword) representations for neural networks. However, segmentation is potentially ambiguous, and it is unclear whether the segmented tokens achieve the best performance for the target task. In this paper, we propose a method to simultaneously learn tokenization and text classification to address these problems. Our model incorporates a language model for unsupervised tokenization into a text classifier and then trains both models simultaneously. To make the model robust against infrequent tokens, we sampled segmentation for each sentence stochastically during training, which resulted in improved performance of text classification. We conducted experiments on sentiment analysis as a text classification task and show that our method achieves better performance than previous methods.
引用
收藏
页码:1620 / 1629
页数:10
相关论文
共 50 条
  • [31] Dynamic Neural Networks for Text Classification
    Vega, Lea
    Mendez-Vazquez, Andres
    [J]. 2016 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND APPLICATIONS (ICCIA), 2016, : 6 - 11
  • [32] Hierarchical Interpretation of Neural Text Classification
    Yan, Hanqi
    Gui, Lin
    He, Yulan
    [J]. COMPUTATIONAL LINGUISTICS, 2022, 48 (04) : 987 - 1020
  • [33] Neural Network for Arabic Text Classification
    Harrag, Fouzi
    El-Qawasmah, Eyas
    [J]. 2009 SECOND INTERNATIONAL CONFERENCE ON THE APPLICATIONS OF DIGITAL INFORMATION AND WEB TECHNOLOGIES (ICADIWT 2009), 2009, : 778 - +
  • [34] Study of Tokenization Strategies for the Santhali Language
    Anand Kumar Ohm
    Koushlendra Kumar Singh
    [J]. SN Computer Science, 5 (7)
  • [35] A Text Clustering Approach of Chinese News Based on Neural Network Language Model
    Zhaoxin Fan
    Shuoying Chen
    Li Zha
    Jiadong Yang
    [J]. International Journal of Parallel Programming, 2016, 44 : 198 - 206
  • [36] A Text Clustering Approach of Chinese News Based on Neural Network Language Model
    Fan, Zhaoxin
    Chen, Shuoying
    Zha, Li
    Yang, Jiadong
    [J]. INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING, 2016, 44 (01) : 198 - 206
  • [37] Affect-LM: A Neural Language Model for Customizable Affective Text Generation
    Ghosh, Sayan
    Chollet, Mathieu
    Laksana, Eugene
    Morency, Louis-Philippe
    Scherer, Stefan
    [J]. PROCEEDINGS OF THE 55TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2017), VOL 1, 2017, : 634 - 642
  • [38] Intelligent Text Mining Model for English Language Using Deep Neural Network
    Singh, Shashi Pal
    Kumar, Ajai
    Darbari, Hemant
    Kaur, Balvinder
    Tiwari, Kanchan
    Joshi, Nisheeth
    [J]. INFORMATION AND COMMUNICATION TECHNOLOGY FOR INTELLIGENT SYSTEMS (ICTIS 2017) - VOL 2, 2018, 84 : 473 - 486
  • [39] Classification of Scientific Documents in the Kazakh Language Using Deep Neural Networks and a Fusion of Images and Text
    Bogdanchikov, Andrey
    Ayazbayev, Dauren
    Varlamis, Iraklis
    [J]. BIG DATA AND COGNITIVE COMPUTING, 2022, 6 (04)
  • [40] GNoM: Graph Neural Network Enhanced Language Models for Disaster Related Multilingual Text Classification
    Ghosh, Samujjwal
    Maji, Subhadeep
    Desarkar, Maunendra Sankar
    [J]. PROCEEDINGS OF THE 14TH ACM WEB SCIENCE CONFERENCE, WEBSCI 2022, 2022, : 55 - 65