Word-Level and Pinyin-Level Based Chinese Short Text Classification

被引：5

作者：

Sun, Xinjie ^{[1
,2
]}

Huo, Xingying ^{[1
]}

机构：

[1] Liupanshui Normal Univ, Inst Comp Sci, Liupanshui 553004, Peoples R China

[2] Guizhou Xinjie Qianxun Software Serv Co Ltd, Liupanshui 553004, Peoples R China

来源：

IEEE ACCESS | 2022年 / 10卷

关键词：

Short text classification; data sparsity; homophonic typos problem; word-level; Pinyin-level; text center; CONVOLUTIONAL NEURAL-NETWORK; CNN;

D O I：

10.1109/ACCESS.2022.3225659

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Short text classification is an important branch of Natural Language Processing. Although CNN and RNN have achieved satisfactory results in the text classification tasks, they are difficult to apply to the Chinese short text classification because of the data sparsity and the homophonic typos problems of them. To solve the above problems, word-level and Pinyin-level based Chinese short text classification model is constructed. Since homophones have the same Pinyin, the addition of Pinyin-level features can solve the homophonic typos problem. In addition, due to the introduction of more features, the data sparsity problem of short text can be solved. In order to fully extract the deep hidden features of the short text, a deep learning model based on BiLSTM, Attention and CNN is constructed, and the residual network is used to solve the gradient disappearance problem with the increase of network layers. Additionally, considering that the complex deep learning network structure will increase the text classification time, the Text Center is constructed. When there is a new text input, the text classification task can be quickly realized by calculating the Manhattan distance between the embedding vector of it and the vectors stored in the Text Center. The Accuracy, Precision, Recall and F1 of the proposed model on the simplifyweibo_4_moods dataset are 0.9713, 0.9627, 0.9765 and 0.9696 respectively, and those on the online_shopping_10_cats dataset are 0.9533, 0.9416, 0.9608 and 0.9511 respectively, which are better than that of the baseline method. In addition, the classification time of the proposed model on simplifyweibo_4_moods and online_shopping_10_cats is 0.0042 and 0.0033 respectively, which is far lower than that of the baseline method.

引用

页码：125552 / 125563

页数：12

共 50 条

[1] A CHINESE CHARACTER-LEVEL AND WORD-LEVEL COMPLEMENTARY TEXT CLASSIFICATION METHOD
Chen, Wentong
Fan, Chunxiao
Wu, Yuexin
Lou, Zhixiong
[J]. 2020 25TH INTERNATIONAL CONFERENCE ON TECHNOLOGIES AND APPLICATIONS OF ARTIFICIAL INTELLIGENCE (TAAI 2020), 2020, : 187 - 192
[2] An Efficient Character-Level and Word-Level Feature Fusion Method for Chinese Text Classification
Jin Wenzhen
Zhu Hong
Yang Guocai
[J]. 2019 3RD INTERNATIONAL CONFERENCE ON MACHINE VISION AND INFORMATION TECHNOLOGY (CMVIT 2019), 2019, 1229
[3] Word-level emotion distribution with two schemas for short text emotion classification
Li, Zongxi
Xie, Haoran
Cheng, Gary
Li, Qing
[J]. KNOWLEDGE-BASED SYSTEMS, 2021, 227
[4] Chinese Multilabel Short Text Classification Method Based on GAN and Pinyin Embedding
Bai, Jinpeng
Li, Xinfu
[J]. IEEE ACCESS, 2024, 12 : 83323 - 83329
[5] Word-level and phrase-level strategies for figurative text identification
Qimeng Yang
Long Yu
Shengwei Tian
Jinmiao Song
[J]. Multimedia Tools and Applications, 2022, 81 : 14339 - 14353
[6] Chinese Unknown Words Extraction Based on Word-Level Characteristics
Pang, Wenbo
Fan, Xiaozhong
Gu, Yijun
Yu, Jiangde
[J]. HIS 2009: 2009 NINTH INTERNATIONAL CONFERENCE ON HYBRID INTELLIGENT SYSTEMS, VOL 1, PROCEEDINGS, 2009, : 361 - +
[7] Word-level and phrase-level strategies for figurative text identification
Yang, Qimeng
Yu, Long
Tian, Shengwei
Song, Jinmiao
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (10) : 14339 - 14353
[8] MODEL FOR WORD-LEVEL CONVERSION OF ARBITRARY TEXT TO SPEECH
ALLEN, J
[J]. JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 1973, 53 (01): : 356 - &
[9] Word-level Chinese named entity recognition based on segmentation digraph
Gao, H
Huang, D
Yang, YS
[J]. PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 380 - 383
[10] Combating Word-level Adversarial Text with Robust Adversarial Training
Du, Xiaohu
Yu, Jie
Li, Shasha
Yi, Zibo
Liu, Hai
Ma, Jun
[J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,

← 1 2 3 4 5 →