BB-GeoGPT: A framework for learning a large language model for geographic information science

被引:12
|
作者
Zhang, Yifan [1 ]
Wang, Zhiyun [1 ]
He, Zhengting [1 ]
Li, Jingxuan [1 ]
Mai, Gengchen [2 ,3 ]
Lin, Jianfeng [4 ]
Wei, Cheng [1 ]
Yu, Wenhao [1 ,5 ]
机构
[1] China Univ Geosci, Sch Geog & Informat Engn, Wuhan 430078, Peoples R China
[2] Univ Texas Austin, Dept Geog & Environm, SEAI Lab, Austin, TX 78712 USA
[3] Univ Georgia, Dept Geog, SEAI Lab, Athens, GA 30602 USA
[4] Meituan, Beijing 100102, Peoples R China
[5] China Univ Geosci, Natl Engn Res Ctr Geog Informat Syst, Wuhan 430078, Peoples R China
基金
中国国家自然科学基金;
关键词
Large language model; GIS knowledge corpus; Domain adaptation; Self-instruct instructions; DISAMBIGUATION;
D O I
10.1016/j.ipm.2024.103808
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Large language models (LLMs) exhibit impressive capabilities across diverse tasks in natural language processing. Nevertheless, challenges arise such as large model parameter size and limited model accessibility through APIs such as ChatGPT and GPT-4, which prohibits the model deployment on mobile devices and domain adaptation or fine-tuning. Moreover, while LLMs excel in general domains, their performance in specialized fields such as GIS may not always align with the expectations of domain experts. This is primarily attributed to the diverse disciplinary origins of the training data, which often lack comprehensive coverage and treatment of knowledge specific to individual disciplines (e.g., GIS). Therefore, there is a crucial need to train and adapt LLMs specifically designed for different professional fields. In this paper, our focus is on the GIS domain, where we introduce BB(BaBy)-GeoGPT, a large language model with GIS-specific knowledge. To achieve this goal, we curated a comprehensive set of resources, comprising model pretraining data (BB-GeoPT, 26,907 documents), supervised fine-tuning data (BB-GeoSFT, 35,876 instructions), and evaluation data (BB-GeoEval, 600 objective questions and 150 subjective questions). BB-GeoGPT is developed by first adapting an open-source generaldomain LLM, the LLaMA-2-7B model, to our pretraining data. Subsequently, we use instruction tuning to further fine-tune the model on our BB-GeoSFT. Through extensive experiments on the evaluation dataset, BB-GeoGPT demonstrates improvements ranging from 10.55% to 47.57% for objective questions and from 7.87% to 27.73% for subjective questions, when compared to general LLMs of similar size in terms of accuracy. Moreover, our data collection strategy and the amassed data can serve as a foundation for advancing LLM research in the GIS domain, fostering further development.
引用
收藏
页数:19
相关论文
共 50 条
  • [1] Establishing a framework for Open Geographic Information science
    Singleton, Alex David
    Spielman, Seth
    Brunsdon, Chris
    INTERNATIONAL JOURNAL OF GEOGRAPHICAL INFORMATION SCIENCE, 2016, 30 (08) : 1507 - 1521
  • [2] Mentoring undergraduates in cartography and geographic information science: An apprenticeship model
    Tian, Jing
    TRANSACTIONS IN GIS, 2017, 21 (06) : 1148 - 1164
  • [3] A novel forecasting framework leveraging large language model and machine learning for methanol price
    Wang, Wenyang
    Luo, Yuping
    Ma, Mingrui
    Wang, Jinglin
    Sui, Cong
    ENERGY, 2025, 320
  • [4] DLAP: A Deep Learning Augmented Large Language Model Prompting framework for software vulnerability detection
    Yang, Yanjing
    Zhou, Xin
    Mao, Runfeng
    Xu, Jinwei
    Yang, Lanxin
    Zhang, Yu
    Shen, Haifeng
    Zhang, He
    JOURNAL OF SYSTEMS AND SOFTWARE, 2025, 219
  • [5] Video Data Model and Retrieval Service Framework Using Geographic Information
    Han, Zhigang
    Cui, Caihui
    Kong, Yunfeng
    Qin, Fen
    Fu, Pinde
    TRANSACTIONS IN GIS, 2016, 20 (05) : 701 - 717
  • [6] GalaxyGPT: A Hybrid Framework for Large Language Model Safety
    Zhou, Hange
    Zheng, Jiabin
    Zhang, Longtu
    IEEE ACCESS, 2024, 12 : 94436 - 94451
  • [7] A FRAMEWORK FOR MODEL CURRICULA DEVELOPMENT IN CARTOGRAPHY AND GEOGRAPHIC INFORMATION-SYSTEMS
    NYERGES, TL
    CHRISMAN, NR
    PROFESSIONAL GEOGRAPHER, 1989, 41 (03): : 283 - 293
  • [8] OceanGPT: A Large Language Model for Ocean Science Tasks
    Bi, Zhen
    Zhang, Ningyu
    Xue, Yida
    Ou, Yixin
    Ji, Daxiong
    Zheng, Guozhou
    Chen, Huajun
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 3357 - 3372
  • [9] Pain Information Model and Its Potential for Predictive Analytics: Applicability of a Big Data Science Framework RH: Information Model Data Science Framework
    Gaedke Nomura, Aline Tsuma
    de Abreu Almeida, Miriam
    Johnson, Steve
    Pruinelli, Lisiane
    JOURNAL OF NURSING SCHOLARSHIP, 2021, 53 (03) : 315 - 322
  • [10] Summary-based model of information retrieval in language model framework
    Li, Weijiang
    Zhao, Tiejun
    Journal of Computational Information Systems, 2009, 5 (03): : 1201 - 1207