Named entity recognition of agricultural based entity-level masking BERT and BiLSTM-CRF

被引:0
|
作者
Wei Z. [1 ]
Song L. [2 ,3 ]
Hu X. [4 ]
Chen N. [1 ,3 ]
机构
[1] School of Computer and Electronics Information, Guangxi University, Nanning
[2] College of Information Engineering, Nanning University, Nanning
[3] Guangxi Key Laboratory of Multimedia Communications and Networks Technology, Nanning
[4] School of Information and Statistics, Guangxi University of Finance and Economics, Nanning
关键词
agriculture; BERT; BiLSTM; CRF; entity-level masking; named entity recognition;
D O I
10.11975/j.issn.1002-6819.2022.15.021
中图分类号
学科分类号
摘要
An intelligent question-answering of agricultural knowledge can be one of the most important parts of information agriculture. Among them, named entity recognition has been a key technology for intelligent question-answering and knowledge graph construction in the fields of agricultural domain. It is also a high demand for the accurate identification of named entities. Furthermore, the Chinese named entity recognition can be confined to the location and semantic information of characters, due to the long length of agricultural entity and complex naming. Therefore, it is very necessary to improve the recognition performance in the process of named entity recognition, particularly for the sufficient capture of character position, contextual semantic features, and long-distance dependency information. In this study, a novel Chinese named entity recognition of agriculture was proposed using EmBERT-BiLSTM-CRF model. Firstly, the Bidirectional Encoder Representation from Transformers (BERT) pre-trained language model was applied as the layer of word embedding. The context semantic representation of the model was then improved to alleviate the polysemy, when pre-training the depth bidirectional representation of word vectors. Secondly, the language masking of BERT was enhanced significantly, according to the characteristics of Chinese. An Entity-level Masking strategy was utilized to completely mask the Chinese entities in the sentence with the consecutive tokens. The Chinese semantics was then better represented to alleviate the bias caused by incomplete semantics. Thirdly, the Bidirectional Long Short-Term Memory Network (BiLSTM) model was adopted to learn the semantic features of long-sequence using two LSTM networks (forward and backward), considering the contextual information in both directions at the same time. The long-distance dependency information of text was then captured during this time. Finally, the Conditional Random Field (CRF) was used to learn the labelling constraint in the training data. Among them, the learned constraint rules were used to detect whether the label sequence was legal during prediction. After that, the CRF also utilized the information of adjacent labels to output the globally optimal label sequence. Thus, the output of the model was a dependent label sequence, but an optimal sequence was considered the rules and order. A focal loss function was also used to alleviate the unbalanced sample distribution. A series of experiments were performed to construct the corpus of named entity recognition. As such, the corpus contained a total of 29 790 agricultural entities after BIO labelling, including 11 057 crops, 8 121 pesticides, 4 505 diseases, and 6 107 pest entities, in which the training, validation, and test set were divided, according to the ratio of 7:2:1. Four types of agricultural entities from the text were identified, including the crop varieties, pesticides, diseases, and insect pests, and then to label them. The experimental results show that the recognition accuracy of the EmBERT-BiLSTM-CRF model for the four types of entities was 94.97%, and the F1 score was 95.93%. Which compared with the models based on BiLSTM-CRF and BERT-BiLSTM-CRF, the recognition performance of EmBERT-BiLSTM-CRF is significantly improved, proved that used pre-trained language model as the a word embedding layer can represent the characteristics of characters well and the Entity-level Masking strategy can alleviate the bias caused by incomplete semantics, thereby enhanced the Chinese semantic representation ability of the model, so that enabling the model to more accurately identify Chinese agricultural named entities. This research can not only provide arelatively high entity recognition accuracy for tasks such as agricultural intelligence question answering, but also offer new ideas for the identification of Chinese named entities in fishery, animal husbandry, Chinese medical, and biological fields. © 2022 Chinese Society of Agricultural Engineering. All rights reserved.
引用
收藏
页码:195 / 203
页数:8
相关论文
共 50 条
  • [41] A Deep Learning Based Approach for Biomedical Named Entity Recognition Using Multitasking Transfer Learning with BiLSTM, BERT and CRF
    Pooja H.
    Jagadeesh M.P.P.
    [J]. SN Computer Science, 5 (5)
  • [42] Intelligent BERT-BiLSTM-CRF Based Legal Case Entity Recognition Method
    Sun, Mingdong
    Guo, Zhixin
    Deng, Xiaolong
    [J]. PROCEEDINGS OF ACM TURING AWARD CELEBRATION CONFERENCE, ACM TURC 2021, 2021, : 186 - 191
  • [43] Named entity recognition from Chinese adverse drug event reports with lexical feature based BiLSTM-CRF and tri-training
    Chen, Yao
    Zhou, Changjiang
    Li, Tianxin
    Wu, Hong
    Zhao, Xia
    Ye, Kai
    Liao, Jun
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2019, 96
  • [44] End-to-End Chinese Entity Recognition Based on BERT-BiLSTM-ATT-CRF
    LI Daiyi
    TU Yaofeng
    ZHOU Xiangsheng
    ZHANG Yangming
    MA Zongmin
    [J]. ZTE Communications, 2022, 20 (S1) : 27 - 35
  • [45] Domain Named Entity Recognition Combining GAN and BiLSTM-Attention-CRF
    Zhang H.
    Guo Y.
    Li T.
    [J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2019, 56 (09): : 1851 - 1858
  • [46] Naming entity recognition of citrus pests and diseases based on the BERT-BiLSTM-CRF model
    Liu, Yafei
    Wei, Siqi
    Huang, Haijun
    Lai, Qin
    Li, Mengshan
    Guan, Lixin
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 234
  • [47] Named Entity Recognition for Long COVID Biomedical Literature by Using Bert-BiLSTM-IDCNN-ATT-CRF Approach
    Han, Zongwang
    Lin, Shaofu
    Huang, Zhisheng
    Guo, Chaohui
    [J]. PROCEEDINGS OF 2023 4TH INTERNATIONAL SYMPOSIUM ON ARTIFICIAL INTELLIGENCE FOR MEDICINE SCIENCE, ISAIMS 2023, 2023, : 1200 - 1205
  • [48] Named Entity Recognition for Chinese EMR with RoBERTa-WWM-BiLSTM-CRF
    Fangcong Z.
    Qiuli Q.
    Yong J.
    Runtao Z.
    [J]. Data Analysis and Knowledge Discovery, 2022, 6 (2-3) : 251 - 262
  • [49] UD_BBC: Named entity recognition in social network combined BERT-BiLSTM-CRF with active learning
    Li, Wei
    Du, Yajun
    Li, Xianyong
    Chen, Xiaoliang
    Xie, Chunzhi
    Li, Hui
    Li, Xiaolei
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2022, 116
  • [50] UD_BBC: Named entity recognition in social network combined BERT-BiLSTM-CRF with active learning
    Li, Wei
    Du, Yajun
    Li, Xianyong
    Chen, Xiaoliang
    Xie, Chunzhi
    Li, Hui
    Li, Xiaolei
    [J]. Engineering Applications of Artificial Intelligence, 2022,