Question-answering Forestry Pre-trained Language Model: ForestBERT

被引：0

作者：

Tan, Jingwei ^{[1
]}

Zhang, Huaiqing ^{[1
]}

Liu, Yang ^{[1
]}

Yang, Jie ^{[1
,2
]}

Zheng, Dongping ^{[3
]}

机构：

[1] Institute of Forest Resource Information Techniques, Chinese Academy of Forestry Key Laboratory of Forestry Remote Sensing and Information System National Forestry and Grassland Administration, Beijing,100091, China

[2] College of Forestry, Beijing Forestry University, Beijing,100083, China

[3] University of Hawai’i at Mānoa, Honolulu,HI,96822, United States

来源：

Linye Kexue/Scientia Silvae Sinicae | 2024年 / 60卷 / 09期

关键词：

Decision making - Information management - Metadata - Natural language processing systems - Problem oriented languages - Self-supervised learning - Semantics - Supervised learning - Web crawler;

D O I：

10.11707/j.1001-7488.LYKX20240435

中图分类号：

学科分类号：

摘要：

【Objective】As for the problems of low utilization of forestry text, insufficient understanding of forestry knowledge by general-domain pre-trained language models, and the time-consuming nature of data annotation, this study makes full use of the massive forestry texts, proposes a pre-trained language model integrating forestry domain knowledge, and efficiently realizes the forestry extractive question answering by automatically annotating the training data, so as to provide intelligent information services for forestry decision-making and management.【Method】First, a forestry corpus was constructed using web crawler technology, encompassing three topics: terminology, law, and literature. This corpus was used to further pre-train the general-domain pre-trained language model BERT. Through self-supervised learning of masked language model and next sentence prediction tasks, BERT was able to effectively learn forestry semantic information, resulting in the pre-trained language model ForestBERT, which has general features of forestry text. Subsequently, the pre-trained language model mT5 was fine-tuned to enable automatic labeling of samples. After manual correction, a forestry extractive question-answering dataset comprising 2 280 samples across the three topics was constructed. Based on the dataset, the six general-domain Chinese pre-trained language models of BERT, RoBERTa, MacBERT, PERT, ELECTRA, and LERT, as well as ForestBERT that was specifically constructed in this study were trained and validated, to identify the advantages of ForestBERT. To investigate the impact of different topics on model performance, all models were fine-tuned on datasets related to the three topics: forestry terminology, forestry law, and forestry literature. Additionally, a visual comparison of the question-answering results in forestry literature between ForestBERT and BERT was performed to more intuitively demonstrate the advantages of ForestBERT.【Result】ForestBERT outperformed the other six comparison models in forestry extractive question-answering task. Compared to the base model BERT, ForestBERT improved the EM score and F1 score by 1.6% and 1.72%, respectively, and showed an average performance improvement of 0.96% over the other five models. Under the optimal division ratio for each model, ForestBERT outperformed BERT and the other five models in EM score by 2.12% and 1.2%, respectively, and in F1 score by 1.88% and 1.26%. Additionally, ForestBERT excelled in all three forestry topics by 3.06%, 1.73%, 2.76% higher than the other five models in evaluation scores for terminology, law, and literature tasks. In all models, the performance was the best in the terminology task, with an average F1 score of 87.63%, and the lowest in the law task, which still reached 82.32%. In the literature extractive question-answering task, ForestBERT provided more accurate and comprehensive answer compared to BERT.【Conclusion】Enhancing the domain-specific knowledge of forestry in general pre-trained language model through further pre-training can effectively improve the accuracy of the model in forestry extractive question-answering task, which provides a new approach for processing and applying texts in forestry and other fields. © 2024 Chinese Society of Forestry. All rights reserved.

引用

页码：99 / 110

共 50 条

[1] Pre-trained Language Model for Biomedical Question Answering
Yoon, Wonjin
Lee, Jinhyuk
Kim, Donghyeon
Jeong, Minbyul
Kang, Jaewoo
[J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2019, PT II, 2020, 1168 : 727 - 740
[2] A Pre-trained Language Model for Medical Question Answering Based on Domain Adaption
Liu, Lang
Ren, Junxiang
Wu, Yuejiao
Song, Ruilin
Cheng, Zhen
Wang, Sibo
[J]. NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2022, PT II, 2022, 13552 : 216 - 227
[3] Textual Pre-Trained Models for Age Screening Across Community Question-Answering
Figueroa, Alejandro
Timilsina, Mohan
[J]. IEEE ACCESS, 2024, 12 : 30030 - 30038
[4] Improving Visual Question Answering with Pre-trained Language Modeling
Wu, Yue
Gao, Huiyi
Chen, Lei
[J]. FIFTH INTERNATIONAL WORKSHOP ON PATTERN RECOGNITION, 2020, 11526
[5] Question Answering based Clinical Text Structuring Using Pre-trained Language Model
Qiu, Jiahui
Zhou, Yangming
Ma, Zhiyuan
Ruan, Tong
Liu, Jinlin
Sun, Jing
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM), 2019, : 1596 - 1600
[6] Textual Pre-Trained Models for Gender Identification Across Community Question-Answering Members
Schwarzenberg, Pablo
Figueroa, Alejandro
[J]. IEEE ACCESS, 2023, 11 : 3983 - 3995
[7] An empirical study of pre-trained language models in simple knowledge graph question answering
Hu, Nan
Wu, Yike
Qi, Guilin
Min, Dehai
Chen, Jiaoyan
Pan, Jeff Z.
Ali, Zafar
[J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2023, 26 (05): : 2855 - 2886
[8] ReLMKG: reasoning with pre-trained language models and knowledge graphs for complex question answering
Xing Cao
Yun Liu
[J]. Applied Intelligence, 2023, 53 : 12032 - 12046
[9] An empirical study of pre-trained language models in simple knowledge graph question answering
Nan Hu
Yike Wu
Guilin Qi
Dehai Min
Jiaoyan Chen
Jeff Z Pan
Zafar Ali
[J]. World Wide Web, 2023, 26 : 2855 - 2886
[10] UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models
Lim, Qi Zhi
Lee, Chin Poo
Lim, Kian Ming
Samingan, Ahmad Kamsani
[J]. IEEE ACCESS, 2024, 12 : 71505 - 71519

← 1 2 3 4 5 →