Word-Level and Pinyin-Level Based Chinese Short Text Classification

被引:5
|
作者
Sun, Xinjie [1 ,2 ]
Huo, Xingying [1 ]
机构
[1] Liupanshui Normal Univ, Inst Comp Sci, Liupanshui 553004, Peoples R China
[2] Guizhou Xinjie Qianxun Software Serv Co Ltd, Liupanshui 553004, Peoples R China
关键词
Short text classification; data sparsity; homophonic typos problem; word-level; Pinyin-level; text center; CONVOLUTIONAL NEURAL-NETWORK; CNN;
D O I
10.1109/ACCESS.2022.3225659
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Short text classification is an important branch of Natural Language Processing. Although CNN and RNN have achieved satisfactory results in the text classification tasks, they are difficult to apply to the Chinese short text classification because of the data sparsity and the homophonic typos problems of them. To solve the above problems, word-level and Pinyin-level based Chinese short text classification model is constructed. Since homophones have the same Pinyin, the addition of Pinyin-level features can solve the homophonic typos problem. In addition, due to the introduction of more features, the data sparsity problem of short text can be solved. In order to fully extract the deep hidden features of the short text, a deep learning model based on BiLSTM, Attention and CNN is constructed, and the residual network is used to solve the gradient disappearance problem with the increase of network layers. Additionally, considering that the complex deep learning network structure will increase the text classification time, the Text Center is constructed. When there is a new text input, the text classification task can be quickly realized by calculating the Manhattan distance between the embedding vector of it and the vectors stored in the Text Center. The Accuracy, Precision, Recall and F1 of the proposed model on the simplifyweibo_4_moods dataset are 0.9713, 0.9627, 0.9765 and 0.9696 respectively, and those on the online_shopping_10_cats dataset are 0.9533, 0.9416, 0.9608 and 0.9511 respectively, which are better than that of the baseline method. In addition, the classification time of the proposed model on simplifyweibo_4_moods and online_shopping_10_cats is 0.0042 and 0.0033 respectively, which is far lower than that of the baseline method.
引用
收藏
页码:125552 / 125563
页数:12
相关论文
共 50 条
  • [21] TC-DWA: Text Clustering with Dual Word-Level Augmentation
    Cheng, Bo
    Li, Ximing
    Chang, Yi
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 6, 2023, : 7113 - 7121
  • [22] The word-level prosody of Samoan
    Zuraw, Kie
    Yu, Kristine M.
    Orfitelli, Robyn
    PHONOLOGY, 2014, 31 (02) : 271 - 327
  • [23] Word-Level Coreference Resolution
    Dobrovolskii, Vladimir
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 7670 - 7675
  • [24] Cascaded Segmentation-Detection Networks for Word-Level Text Spotting
    Qin, Siyang
    Manduchi, Roberto
    2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 1275 - 1282
  • [25] Functional test generation based on word-level SAT
    Zeng, ZH
    Talupuru, KR
    Clesielski, M
    JOURNAL OF SYSTEMS ARCHITECTURE, 2005, 51 (08) : 488 - 511
  • [26] Relating foveal and parafoveal processing efficiency with word-level parameters in text reading
    Heikkila, Timo T.
    Soralinna, Nea
    Hyona, Jukka
    JOURNAL OF MEMORY AND LANGUAGE, 2024, 137
  • [27] Database of word-level statistics for Mandarin Chinese (DoWLS-MAN)
    Karl David Neergaard
    Hongzhi Xu
    James S. German
    Chu-Ren Huang
    Behavior Research Methods, 2022, 54 : 987 - 1009
  • [28] Is Word-Level Recursion Actually Recursion?
    Miller, Taylor L.
    Sande, Hannah
    LANGUAGES, 2021, 6 (02)
  • [29] Lifting propositional interpolants to the word-level
    Kroening, Daniel
    Weissenbacher, Georg
    FMCAD 2007: FORMAL METHODS IN COMPUTER AIDED DESIGN, PROCEEDINGS, 2007, : 85 - 89
  • [30] Assistant diagnosis with Chinese electronic medical records based on CNN and BiLSTM with phrase-level and word-level attentions
    Tong Wang
    Ping Xuan
    Zonglin Liu
    Tiangang Zhang
    BMC Bioinformatics, 21