Question-answering Forestry Pre-trained Language Model: ForestBERT

被引:0
|
作者
Tan, Jingwei [1 ]
Zhang, Huaiqing [1 ]
Liu, Yang [1 ]
Yang, Jie [1 ,2 ]
Zheng, Dongping [3 ]
机构
[1] Institute of Forest Resource Information Techniques, Chinese Academy of Forestry Key Laboratory of Forestry Remote Sensing and Information System National Forestry and Grassland Administration, Beijing,100091, China
[2] College of Forestry, Beijing Forestry University, Beijing,100083, China
[3] University of Hawai’i at Mānoa, Honolulu,HI,96822, United States
来源
Linye Kexue/Scientia Silvae Sinicae | 2024年 / 60卷 / 09期
关键词
Decision making - Information management - Metadata - Natural language processing systems - Problem oriented languages - Self-supervised learning - Semantics - Supervised learning - Web crawler;
D O I
10.11707/j.1001-7488.LYKX20240435
中图分类号
学科分类号
摘要
【Objective】As for the problems of low utilization of forestry text, insufficient understanding of forestry knowledge by general-domain pre-trained language models, and the time-consuming nature of data annotation, this study makes full use of the massive forestry texts, proposes a pre-trained language model integrating forestry domain knowledge, and efficiently realizes the forestry extractive question answering by automatically annotating the training data, so as to provide intelligent information services for forestry decision-making and management.【Method】First, a forestry corpus was constructed using web crawler technology, encompassing three topics: terminology, law, and literature. This corpus was used to further pre-train the general-domain pre-trained language model BERT. Through self-supervised learning of masked language model and next sentence prediction tasks, BERT was able to effectively learn forestry semantic information, resulting in the pre-trained language model ForestBERT, which has general features of forestry text. Subsequently, the pre-trained language model mT5 was fine-tuned to enable automatic labeling of samples. After manual correction, a forestry extractive question-answering dataset comprising 2 280 samples across the three topics was constructed. Based on the dataset, the six general-domain Chinese pre-trained language models of BERT, RoBERTa, MacBERT, PERT, ELECTRA, and LERT, as well as ForestBERT that was specifically constructed in this study were trained and validated, to identify the advantages of ForestBERT. To investigate the impact of different topics on model performance, all models were fine-tuned on datasets related to the three topics: forestry terminology, forestry law, and forestry literature. Additionally, a visual comparison of the question-answering results in forestry literature between ForestBERT and BERT was performed to more intuitively demonstrate the advantages of ForestBERT.【Result】ForestBERT outperformed the other six comparison models in forestry extractive question-answering task. Compared to the base model BERT, ForestBERT improved the EM score and F1 score by 1.6% and 1.72%, respectively, and showed an average performance improvement of 0.96% over the other five models. Under the optimal division ratio for each model, ForestBERT outperformed BERT and the other five models in EM score by 2.12% and 1.2%, respectively, and in F1 score by 1.88% and 1.26%. Additionally, ForestBERT excelled in all three forestry topics by 3.06%, 1.73%, 2.76% higher than the other five models in evaluation scores for terminology, law, and literature tasks. In all models, the performance was the best in the terminology task, with an average F1 score of 87.63%, and the lowest in the law task, which still reached 82.32%. In the literature extractive question-answering task, ForestBERT provided more accurate and comprehensive answer compared to BERT.【Conclusion】Enhancing the domain-specific knowledge of forestry in general pre-trained language model through further pre-training can effectively improve the accuracy of the model in forestry extractive question-answering task, which provides a new approach for processing and applying texts in forestry and other fields. © 2024 Chinese Society of Forestry. All rights reserved.
引用
收藏
页码:99 / 110
相关论文
共 50 条
  • [1] Pre-trained Language Model for Biomedical Question Answering
    Yoon, Wonjin
    Lee, Jinhyuk
    Kim, Donghyeon
    Jeong, Minbyul
    Kang, Jaewoo
    [J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2019, PT II, 2020, 1168 : 727 - 740
  • [2] A Pre-trained Language Model for Medical Question Answering Based on Domain Adaption
    Liu, Lang
    Ren, Junxiang
    Wu, Yuejiao
    Song, Ruilin
    Cheng, Zhen
    Wang, Sibo
    [J]. NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT II, 2022, 13552 : 216 - 227
  • [3] Textual Pre-Trained Models for Age Screening Across Community Question-Answering
    Figueroa, Alejandro
    Timilsina, Mohan
    [J]. IEEE ACCESS, 2024, 12 : 30030 - 30038
  • [4] Improving Visual Question Answering with Pre-trained Language Modeling
    Wu, Yue
    Gao, Huiyi
    Chen, Lei
    [J]. FIFTH INTERNATIONAL WORKSHOP ON PATTERN RECOGNITION, 2020, 11526
  • [5] Question Answering based Clinical Text Structuring Using Pre-trained Language Model
    Qiu, Jiahui
    Zhou, Yangming
    Ma, Zhiyuan
    Ruan, Tong
    Liu, Jinlin
    Sun, Jing
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2019, : 1596 - 1600
  • [6] Textual Pre-Trained Models for Gender Identification Across Community Question-Answering Members
    Schwarzenberg, Pablo
    Figueroa, Alejandro
    [J]. IEEE ACCESS, 2023, 11 : 3983 - 3995
  • [7] An empirical study of pre-trained language models in simple knowledge graph question answering
    Hu, Nan
    Wu, Yike
    Qi, Guilin
    Min, Dehai
    Chen, Jiaoyan
    Pan, Jeff Z.
    Ali, Zafar
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2023, 26 (05): : 2855 - 2886
  • [8] ReLMKG: reasoning with pre-trained language models and knowledge graphs for complex question answering
    Xing Cao
    Yun Liu
    [J]. Applied Intelligence, 2023, 53 : 12032 - 12046
  • [9] An empirical study of pre-trained language models in simple knowledge graph question answering
    Nan Hu
    Yike Wu
    Guilin Qi
    Dehai Min
    Jiaoyan Chen
    Jeff Z Pan
    Zafar Ali
    [J]. World Wide Web, 2023, 26 : 2855 - 2886
  • [10] UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models
    Lim, Qi Zhi
    Lee, Chin Poo
    Lim, Kian Ming
    Samingan, Ahmad Kamsani
    [J]. IEEE ACCESS, 2024, 12 : 71505 - 71519