Word-Level and Pinyin-Level Based Chinese Short Text Classification

被引:5
|
作者
Sun, Xinjie [1 ,2 ]
Huo, Xingying [1 ]
机构
[1] Liupanshui Normal Univ, Inst Comp Sci, Liupanshui 553004, Peoples R China
[2] Guizhou Xinjie Qianxun Software Serv Co Ltd, Liupanshui 553004, Peoples R China
关键词
Short text classification; data sparsity; homophonic typos problem; word-level; Pinyin-level; text center; CONVOLUTIONAL NEURAL-NETWORK; CNN;
D O I
10.1109/ACCESS.2022.3225659
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Short text classification is an important branch of Natural Language Processing. Although CNN and RNN have achieved satisfactory results in the text classification tasks, they are difficult to apply to the Chinese short text classification because of the data sparsity and the homophonic typos problems of them. To solve the above problems, word-level and Pinyin-level based Chinese short text classification model is constructed. Since homophones have the same Pinyin, the addition of Pinyin-level features can solve the homophonic typos problem. In addition, due to the introduction of more features, the data sparsity problem of short text can be solved. In order to fully extract the deep hidden features of the short text, a deep learning model based on BiLSTM, Attention and CNN is constructed, and the residual network is used to solve the gradient disappearance problem with the increase of network layers. Additionally, considering that the complex deep learning network structure will increase the text classification time, the Text Center is constructed. When there is a new text input, the text classification task can be quickly realized by calculating the Manhattan distance between the embedding vector of it and the vectors stored in the Text Center. The Accuracy, Precision, Recall and F1 of the proposed model on the simplifyweibo_4_moods dataset are 0.9713, 0.9627, 0.9765 and 0.9696 respectively, and those on the online_shopping_10_cats dataset are 0.9533, 0.9416, 0.9608 and 0.9511 respectively, which are better than that of the baseline method. In addition, the classification time of the proposed model on simplifyweibo_4_moods and online_shopping_10_cats is 0.0042 and 0.0033 respectively, which is far lower than that of the baseline method.
引用
收藏
页码:125552 / 125563
页数:12
相关论文
共 50 条
  • [1] A CHINESE CHARACTER-LEVEL AND WORD-LEVEL COMPLEMENTARY TEXT CLASSIFICATION METHOD
    Chen, Wentong
    Fan, Chunxiao
    Wu, Yuexin
    Lou, Zhixiong
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI 2020), 2020, : 187 - 192
  • [2] An Efficient Character-Level and Word-Level Feature Fusion Method for Chinese Text Classification
    Jin Wenzhen
    Zhu Hong
    Yang Guocai
    [J]. 2019 3RD INTERNATIONAL CONFERENCE ON MACHINE VISION AND INFORMATION TECHNOLOGY (CMVIT 2019), 2019, 1229
  • [3] Word-level emotion distribution with two schemas for short text emotion classification
    Li, Zongxi
    Xie, Haoran
    Cheng, Gary
    Li, Qing
    [J]. KNOWLEDGE-BASED SYSTEMS, 2021, 227
  • [4] Chinese Multilabel Short Text Classification Method Based on GAN and Pinyin Embedding
    Bai, Jinpeng
    Li, Xinfu
    [J]. IEEE ACCESS, 2024, 12 : 83323 - 83329
  • [5] Word-level and phrase-level strategies for figurative text identification
    Qimeng Yang
    Long Yu
    Shengwei Tian
    Jinmiao Song
    [J]. Multimedia Tools and Applications, 2022, 81 : 14339 - 14353
  • [6] Chinese Unknown Words Extraction Based on Word-Level Characteristics
    Pang, Wenbo
    Fan, Xiaozhong
    Gu, Yijun
    Yu, Jiangde
    [J]. HIS 2009: 2009 NINTH INTERNATIONAL CONFERENCE ON HYBRID INTELLIGENT SYSTEMS, VOL 1, PROCEEDINGS, 2009, : 361 - +
  • [7] Word-level and phrase-level strategies for figurative text identification
    Yang, Qimeng
    Yu, Long
    Tian, Shengwei
    Song, Jinmiao
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (10) : 14339 - 14353
  • [8] MODEL FOR WORD-LEVEL CONVERSION OF ARBITRARY TEXT TO SPEECH
    ALLEN, J
    [J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1973, 53 (01): : 356 - &
  • [9] Word-level Chinese named entity recognition based on segmentation digraph
    Gao, H
    Huang, D
    Yang, YS
    [J]. PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 380 - 383
  • [10] Combating Word-level Adversarial Text with Robust Adversarial Training
    Du, Xiaohu
    Yu, Jie
    Li, Shasha
    Yi, Zibo
    Liu, Hai
    Ma, Jun
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,