Stochastic Tokenization with a Language Model for Neural Text Classification

被引:0
|
作者
Hiraoka, Tatsuya [1 ]
Shindo, Hiroyuki [2 ]
Matsumoto, Yuji [2 ,3 ]
机构
[1] Tokyo Inst Technol, Tokyo, Japan
[2] Nara Inst Sci & Technol, Ikoma, Nara, Japan
[3] RIKEN Ctr Adv Intelligence Project AIP, Wako, Saitama, Japan
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For unsegmented languages such as Japanese and Chinese, tokenization of a sentence has a significant impact on the performance of text classification. Sentences are usually segmented with words or subwords by a morphological analyzer or byte pair encoding and then encoded with word (or subword) representations for neural networks. However, segmentation is potentially ambiguous, and it is unclear whether the segmented tokens achieve the best performance for the target task. In this paper, we propose a method to simultaneously learn tokenization and text classification to address these problems. Our model incorporates a language model for unsupervised tokenization into a text classifier and then trains both models simultaneously. To make the model robust against infrequent tokens, we sampled segmentation for each sentence stochastically during training, which resulted in improved performance of text classification. We conducted experiments on sentiment analysis as a text classification task and show that our method achieves better performance than previous methods.
引用
收藏
页码:1620 / 1629
页数:10
相关论文
共 50 条
  • [1] Tokenization-based data augmentation for text classification
    Prakrankamanant, Patawee
    Chuangsuwanich, Ekapol
    [J]. 2022 19TH INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING (JCSSE 2022), 2022,
  • [2] Neural Sign Language Translation by Learning Tokenization
    Orbay, Alptekin
    Akarun, Lale
    [J]. 2020 15TH IEEE INTERNATIONAL CONFERENCE ON AUTOMATIC FACE AND GESTURE RECOGNITION (FG 2020), 2020, : 222 - 228
  • [3] Morpheme Matching Based Text Tokenization for a Scarce Resourced Language
    Rehman, Zobia
    Anwar, Waqas
    Bajwa, Usama Ijaz
    Wang Xuan
    Zhou Chaoying
    [J]. PLOS ONE, 2013, 8 (08):
  • [4] A neural model for type classification of entities for text
    Li, Qi
    Dong, JunQi
    Zhong, Jiang
    Li, Qing
    Wang, Chen
    [J]. KNOWLEDGE-BASED SYSTEMS, 2019, 176 : 122 - 132
  • [5] Language-Independent Text Tokenization Using Unsupervised Deep Learning
    Mahmoud, Hanan A. Hosni
    Hafez, Alaaeldin M.
    Alabdulkreem, Eatedal
    [J]. INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2023, 35 (01): : 321 - 334
  • [6] Character N-Gram Tokenization for European Language Text Retrieval
    Paul McNamee
    James Mayfield
    [J]. Information Retrieval, 2004, 7 : 73 - 97
  • [7] Character N-gram tokenization for European language text retrieval
    McNamee, P
    Mayfield, J
    [J]. INFORMATION RETRIEVAL, 2004, 7 (1-2): : 73 - 97
  • [8] Language identification using Gaussian mixture model tokenization
    Torres-Carrasquillo, PA
    Reynolds, DA
    Deller, JR
    [J]. 2002 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOLS I-IV, PROCEEDINGS, 2002, : 757 - 760
  • [9] Deep Neural Network Models for Paraphrased Text Classification in the Arabic Language
    Mahmoud, Adnen
    Zrigui, Mounir
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2019), 2019, 11608 : 3 - 16
  • [10] A Stochastic Neural Model for Fast Classification of Binary Images
    Pires, Glauber M.
    Araujo, Aluizio F. R.
    [J]. IJCNN: 2009 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, VOLS 1- 6, 2009, : 2212 - 2217