Evaluating large language models on geospatial tasks: a multiple geospatial task benchmarking study

被引:0
|
作者
Xu, Liuchang [1 ,2 ,5 ]
Zhao, Shuo [1 ]
Lin, Qingming [1 ]
Chen, Luyao [1 ]
Luo, Qianqian [1 ]
Wu, Sensen [2 ]
Ye, Xinyue [3 ,4 ]
Feng, Hailin [1 ]
Du, Zhenhong [2 ]
机构
[1] Zhejiang Agr & Forestry Univ, Sch Math & Comp Sci, Hangzhou, Peoples R China
[2] Zhejiang Univ, Sch Earth Sci, Hangzhou 310058, Peoples R China
[3] Texas A&M Univ, Dept Landscape Architecture & Urban Planning, College Stn, TX USA
[4] Texas A&M Univ, Ctr Geospatial Sci Applicat & Technol, College Stn, TX USA
[5] Sunyard Technol Co Ltd, Financial Big Data Res Inst, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Large language models; ChatGPT; benchmarking; spatial reasoning; prompt engineering;
D O I
10.1080/17538947.2025.2480268
中图分类号
P9 [自然地理学];
学科分类号
0705 ; 070501 ;
摘要
The emergence of large language models like ChatGPT and Gemini has highlighted the need to assess their diverse capabilities. However, their performance on geospatial tasks remains underexplored. This study introduces a novel multi-task spatial evaluation dataset to address this gap, covering twelve task types, including spatial understanding and route planning, with verified answers. We evaluated several models, including OpenAI's gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI's glm-4, Anthropic's claude-3-sonnet-20240229, and MoonShot's moonshot-v1-8k, using a two-phase testing approach: zero-shot testing followed by difficulty-based categorization and prompt tuning. Results show that gpt-4o had the highest overall accuracy in the first phase at 71.3%. Though moonshot-v1-8k performed slightly worse overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on performance, such as the Chain-of-Thought strategy, which boosted gpt-4o's accuracy in route planning from 12.4% to 87.5%, and a one-shot strategy that raised moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.
引用
收藏
页数:32
相关论文
共 50 条
  • [21] MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
    Liu, Mianxin
    Hu, Weiguo
    Ding, Jinru
    Xu, Jie
    Li, Xiaoyang
    Zhu, Lifeng
    Bai, Zhian
    Shi, Xiaoming
    Wang, Benyou
    Song, Haitao
    Liu, Pengfei
    Zhang, Xiaofan
    Wang, Shanshan
    Li, Kang
    Wang, Haofen
    Ruan, Tong
    Huang, Xuanjing
    Sun, Xin
    Zhang, Shaoting
    BIG DATA MINING AND ANALYTICS, 2024, 7 (04): : 1116 - 1128
  • [22] Evaluating Landscape Attractiveness with Geospatial Data, A Case Study in Flanders, Belgium
    Vannoppen, Astrid
    Degerickx, Jeroen
    Gobin, Anne
    LAND, 2021, 10 (07)
  • [23] Evaluating geospatial education provision: a case study of Aotearoa New Zealand
    de Roiste, Mairead
    Pool, Scott C.
    Lowry, John H.
    JOURNAL OF GEOGRAPHY IN HIGHER EDUCATION, 2024,
  • [24] Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study
    Tamberg, Karl
    Bahsi, Hayretdin
    IEEE ACCESS, 2025, 13 : 29698 - 29717
  • [25] FELM: Benchmarking Factuality Evaluation of Large Language Models
    Chen, Shiqi
    Zhao, Yiran
    Zhang, Jinghan
    Chern, I-Chun
    Gao, Siyang
    Liu, Pengfei
    He, Junxian
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [26] Benchmarking Biomedical Relation Knowledge in Large Language Models
    Zhang, Fenghui
    Yang, Kuo
    Zhao, Chenqian
    Li, Haixu
    Dong, Xin
    Tian, Haoyu
    Zhou, Xuezhong
    BIOINFORMATICS RESEARCH AND APPLICATIONS, PT II, ISBRA 2024, 2024, 14955 : 482 - 495
  • [27] Benchmarking Cognitive Biases in Large Language Models as Evaluators
    Koo, Ryan
    Lee, Minhwa
    Raheja, Vipul
    Park, Jongin
    Kim, Zae Myung
    Kang, Dongyeop
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 517 - 545
  • [28] TOMBENCH: Benchmarking Theory of Mind in Large Language Models
    Chen, Zhuang
    Wu, Jincenzi
    Zhou, Jinfeng
    Wen, Bosi
    Bi, Guanqun
    Jiang, Gongyao
    Cao, Yaru
    Hu, Mengting
    Lai, Yunghwei
    Xiong, Zexuan
    Huang, Minlie
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 15959 - 15983
  • [29] HiBenchLLM: Historical Inquiry Benchmarking for Large Language Models
    Chartier, Mathieu
    Dakkoune, Nabil
    Bourgeois, Guillaume
    Jean, Stephane
    DATA & KNOWLEDGE ENGINEERING, 2025, 156
  • [30] LAraBench: Benchmarking Arabic AI with Large Language Models
    Qatar Computing Research Institute, HBKU, Qatar
    不详
    arXiv, 1600,