Evaluating large language models on geospatial tasks: a multiple geospatial task benchmarking study

被引:0
|
作者
Xu, Liuchang [1 ,2 ,5 ]
Zhao, Shuo [1 ]
Lin, Qingming [1 ]
Chen, Luyao [1 ]
Luo, Qianqian [1 ]
Wu, Sensen [2 ]
Ye, Xinyue [3 ,4 ]
Feng, Hailin [1 ]
Du, Zhenhong [2 ]
机构
[1] Zhejiang Agr & Forestry Univ, Sch Math & Comp Sci, Hangzhou, Peoples R China
[2] Zhejiang Univ, Sch Earth Sci, Hangzhou 310058, Peoples R China
[3] Texas A&M Univ, Dept Landscape Architecture & Urban Planning, College Stn, TX USA
[4] Texas A&M Univ, Ctr Geospatial Sci Applicat & Technol, College Stn, TX USA
[5] Sunyard Technol Co Ltd, Financial Big Data Res Inst, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Large language models; ChatGPT; benchmarking; spatial reasoning; prompt engineering;
D O I
10.1080/17538947.2025.2480268
中图分类号
P9 [自然地理学];
学科分类号
0705 ; 070501 ;
摘要
The emergence of large language models like ChatGPT and Gemini has highlighted the need to assess their diverse capabilities. However, their performance on geospatial tasks remains underexplored. This study introduces a novel multi-task spatial evaluation dataset to address this gap, covering twelve task types, including spatial understanding and route planning, with verified answers. We evaluated several models, including OpenAI's gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI's glm-4, Anthropic's claude-3-sonnet-20240229, and MoonShot's moonshot-v1-8k, using a two-phase testing approach: zero-shot testing followed by difficulty-based categorization and prompt tuning. Results show that gpt-4o had the highest overall accuracy in the first phase at 71.3%. Though moonshot-v1-8k performed slightly worse overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on performance, such as the Chain-of-Thought strategy, which boosted gpt-4o's accuracy in route planning from 12.4% to 87.5%, and a one-shot strategy that raised moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.
引用
收藏
页数:32
相关论文
共 50 条
  • [1] Can large language models generate geospatial code?
    State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing, Wuhan University, Wuhan, China
    不详
    arXiv, 1600,
  • [2] ChatGeoAI: Enabling Geospatial Analysis for Public through Natural Language, with Large Language Models
    Mansourian, Ali
    Oucheikh, Rachid
    ISPRS INTERNATIONAL JOURNAL OF GEO-INFORMATION, 2024, 13 (10)
  • [3] Evaluating Large Language Models on Controlled Generation Tasks
    Sun, Jiao
    Tian, Yufei
    Zhou, Wangchunshu
    Xu, Nan
    Hu, Qian
    Gupta, Rahul
    Wieting, John
    Peng, Nanyun
    Ma, Xuezhe
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3155 - 3168
  • [4] Evaluating large language models in theory of mind tasks
    Kosinski, Michal
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2024, 121 (45)
  • [5] Collaborating Underwater Vehicles Conducting Large-Scale Geospatial Tasks
    Kuhlman, Michael J.
    Jones, Dylan
    Sofge, Donald A.
    Hollinger, Geoffrey A.
    Gupta, Satyandra K.
    IEEE JOURNAL OF OCEANIC ENGINEERING, 2021, 46 (03) : 785 - 807
  • [6] ACT-R models of information foraging in geospatial intelligence tasks
    Paik, Jaehyon
    Pirolli, Peter
    COMPUTATIONAL AND MATHEMATICAL ORGANIZATION THEORY, 2015, 21 (03) : 274 - 295
  • [7] ACT-R models of information foraging in geospatial intelligence tasks
    Jaehyon Paik
    Peter Pirolli
    Computational and Mathematical Organization Theory, 2015, 21 : 274 - 295
  • [8] GPT, large language models (LLMs) and generative artificial intelligence (GAI) models in geospatial science: a systematic review
    Wang, Siqin
    Hu, Tao
    Xiao, Huang
    Li, Yun
    Zhang, Ce
    Ning, Huan
    Zhu, Rui
    Li, Zhenlong
    Ye, Xinyue
    INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2024, 17 (01)
  • [9] Benchmarking medical large language models
    Bakhshandeh, Sadra
    NATURE REVIEWS BIOENGINEERING, 2023, 1 (08): : 543 - 543
  • [10] Assessing multi-hazards related to tropical cyclones through large language models and geospatial approaches
    Zhou, Yao
    Liu, Ping
    ENVIRONMENTAL RESEARCH LETTERS, 2024, 19 (12):