Word-Level and Pinyin-Level Based Chinese Short Text Classification

被引：5

作者：

Sun, Xinjie ^{[1
,2
]}

Huo, Xingying ^{[1
]}

机构：

[1] Liupanshui Normal Univ, Inst Comp Sci, Liupanshui 553004, Peoples R China

[2] Guizhou Xinjie Qianxun Software Serv Co Ltd, Liupanshui 553004, Peoples R China

来源：

IEEE ACCESS | 2022年 / 10卷

关键词：

Short text classification; data sparsity; homophonic typos problem; word-level; Pinyin-level; text center; CONVOLUTIONAL NEURAL-NETWORK; CNN;

D O I：

10.1109/ACCESS.2022.3225659

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Short text classification is an important branch of Natural Language Processing. Although CNN and RNN have achieved satisfactory results in the text classification tasks, they are difficult to apply to the Chinese short text classification because of the data sparsity and the homophonic typos problems of them. To solve the above problems, word-level and Pinyin-level based Chinese short text classification model is constructed. Since homophones have the same Pinyin, the addition of Pinyin-level features can solve the homophonic typos problem. In addition, due to the introduction of more features, the data sparsity problem of short text can be solved. In order to fully extract the deep hidden features of the short text, a deep learning model based on BiLSTM, Attention and CNN is constructed, and the residual network is used to solve the gradient disappearance problem with the increase of network layers. Additionally, considering that the complex deep learning network structure will increase the text classification time, the Text Center is constructed. When there is a new text input, the text classification task can be quickly realized by calculating the Manhattan distance between the embedding vector of it and the vectors stored in the Text Center. The Accuracy, Precision, Recall and F1 of the proposed model on the simplifyweibo_4_moods dataset are 0.9713, 0.9627, 0.9765 and 0.9696 respectively, and those on the online_shopping_10_cats dataset are 0.9533, 0.9416, 0.9608 and 0.9511 respectively, which are better than that of the baseline method. In addition, the classification time of the proposed model on simplifyweibo_4_moods and online_shopping_10_cats is 0.0042 and 0.0033 respectively, which is far lower than that of the baseline method.

引用

页码：125552 / 125563

页数：12

共 50 条

[21] TC-DWA: Text Clustering with Dual Word-Level Augmentation
Cheng, Bo
Li, Ximing
Chang, Yi
THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 6, 2023, : 7113 - 7121
[22] The word-level prosody of Samoan
Zuraw, Kie
Yu, Kristine M.
Orfitelli, Robyn
PHONOLOGY, 2014, 31 (02) : 271 - 327
[23] Word-Level Coreference Resolution
Dobrovolskii, Vladimir
2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 7670 - 7675
[24] Cascaded Segmentation-Detection Networks for Word-Level Text Spotting
Qin, Siyang
Manduchi, Roberto
2017 14TH IAPR INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION (ICDAR), VOL 1, 2017, : 1275 - 1282
[25] Functional test generation based on word-level SAT
Zeng, ZH
Talupuru, KR
Clesielski, M
JOURNAL OF SYSTEMS ARCHITECTURE, 2005, 51 (08) : 488 - 511
[26] Relating foveal and parafoveal processing efficiency with word-level parameters in text reading
Heikkila, Timo T.
Soralinna, Nea
Hyona, Jukka
JOURNAL OF MEMORY AND LANGUAGE, 2024, 137
[27] Database of word-level statistics for Mandarin Chinese (DoWLS-MAN)
Karl David Neergaard
Hongzhi Xu
James S. German
Chu-Ren Huang
Behavior Research Methods, 2022, 54 : 987 - 1009
[28] Is Word-Level Recursion Actually Recursion?
Miller, Taylor L.
Sande, Hannah
LANGUAGES, 2021, 6 (02)
[29] Lifting propositional interpolants to the word-level
Kroening, Daniel
Weissenbacher, Georg
FMCAD 2007: FORMAL METHODS IN COMPUTER AIDED DESIGN, PROCEEDINGS, 2007, : 85 - 89
[30] Assistant diagnosis with Chinese electronic medical records based on CNN and BiLSTM with phrase-level and word-level attentions
Tong Wang
Ping Xuan
Zonglin Liu
Tiangang Zhang
BMC Bioinformatics, 21

← 1 2 3 4 5 →