【Objective】As for the problems of low utilization of forestry text, insufficient understanding of forestry knowledge by general-domain pre-trained language models, and the time-consuming nature of data annotation, this study makes full use of the massive forestry texts, proposes a pre-trained language model integrating forestry domain knowledge, and efficiently realizes the forestry extractive question answering by automatically annotating the training data, so as to provide intelligent information services for forestry decision-making and management.【Method】First, a forestry corpus was constructed using web crawler technology, encompassing three topics: terminology, law, and literature. This corpus was used to further pre-train the general-domain pre-trained language model BERT. Through self-supervised learning of masked language model and next sentence prediction tasks, BERT was able to effectively learn forestry semantic information, resulting in the pre-trained language model ForestBERT, which has general features of forestry text. Subsequently, the pre-trained language model mT5 was fine-tuned to enable automatic labeling of samples. After manual correction, a forestry extractive question-answering dataset comprising 2 280 samples across the three topics was constructed. Based on the dataset, the six general-domain Chinese pre-trained language models of BERT, RoBERTa, MacBERT, PERT, ELECTRA, and LERT, as well as ForestBERT that was specifically constructed in this study were trained and validated, to identify the advantages of ForestBERT. To investigate the impact of different topics on model performance, all models were fine-tuned on datasets related to the three topics: forestry terminology, forestry law, and forestry literature. Additionally, a visual comparison of the question-answering results in forestry literature between ForestBERT and BERT was performed to more intuitively demonstrate the advantages of ForestBERT.【Result】ForestBERT outperformed the other six comparison models in forestry extractive question-answering task. Compared to the base model BERT, ForestBERT improved the EM score and F1 score by 1.6% and 1.72%, respectively, and showed an average performance improvement of 0.96% over the other five models. Under the optimal division ratio for each model, ForestBERT outperformed BERT and the other five models in EM score by 2.12% and 1.2%, respectively, and in F1 score by 1.88% and 1.26%. Additionally, ForestBERT excelled in all three forestry topics by 3.06%, 1.73%, 2.76% higher than the other five models in evaluation scores for terminology, law, and literature tasks. In all models, the performance was the best in the terminology task, with an average F1 score of 87.63%, and the lowest in the law task, which still reached 82.32%. In the literature extractive question-answering task, ForestBERT provided more accurate and comprehensive answer compared to BERT.【Conclusion】Enhancing the domain-specific knowledge of forestry in general pre-trained language model through further pre-training can effectively improve the accuracy of the model in forestry extractive question-answering task, which provides a new approach for processing and applying texts in forestry and other fields. © 2024 Chinese Society of Forestry. All rights reserved.