Evaluating large language models on geospatial tasks: a multiple geospatial task benchmarking study

被引：0

作者：

Xu, Liuchang ^{[1
,2
,5
]}

Zhao, Shuo ^{[1
]}

Lin, Qingming ^{[1
]}

Chen, Luyao ^{[1
]}

Luo, Qianqian ^{[1
]}

Wu, Sensen ^{[2
]}

Ye, Xinyue ^{[3
,4
]}

Feng, Hailin ^{[1
]}

Du, Zhenhong ^{[2
]}

机构：

[1] Zhejiang Agr & Forestry Univ, Sch Math & Comp Sci, Hangzhou, Peoples R China

[2] Zhejiang Univ, Sch Earth Sci, Hangzhou 310058, Peoples R China

[3] Texas A&M Univ, Dept Landscape Architecture & Urban Planning, College Stn, TX USA

[4] Texas A&M Univ, Ctr Geospatial Sci Applicat & Technol, College Stn, TX USA

[5] Sunyard Technol Co Ltd, Financial Big Data Res Inst, Hangzhou, Peoples R China

来源：

INTERNATIONAL JOURNAL OF DIGITAL EARTH | 2025年 / 18卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Large language models; ChatGPT; benchmarking; spatial reasoning; prompt engineering;

D O I：

10.1080/17538947.2025.2480268

中图分类号：

P9 [自然地理学];

学科分类号：

0705 ; 070501 ;

摘要：

The emergence of large language models like ChatGPT and Gemini has highlighted the need to assess their diverse capabilities. However, their performance on geospatial tasks remains underexplored. This study introduces a novel multi-task spatial evaluation dataset to address this gap, covering twelve task types, including spatial understanding and route planning, with verified answers. We evaluated several models, including OpenAI's gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI's glm-4, Anthropic's claude-3-sonnet-20240229, and MoonShot's moonshot-v1-8k, using a two-phase testing approach: zero-shot testing followed by difficulty-based categorization and prompt tuning. Results show that gpt-4o had the highest overall accuracy in the first phase at 71.3%. Though moonshot-v1-8k performed slightly worse overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on performance, such as the Chain-of-Thought strategy, which boosted gpt-4o's accuracy in route planning from 12.4% to 87.5%, and a one-shot strategy that raised moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.

引用

页数：32

共 50 条

[21] MedBench: A Comprehensive, Standardized, and Reliable Benchmarking System for Evaluating Chinese Medical Large Language Models
Liu, Mianxin
Hu, Weiguo
Ding, Jinru
Xu, Jie
Li, Xiaoyang
Zhu, Lifeng
Bai, Zhian
Shi, Xiaoming
Wang, Benyou
Song, Haitao
Liu, Pengfei
Zhang, Xiaofan
Wang, Shanshan
Li, Kang
Wang, Haofen
Ruan, Tong
Huang, Xuanjing
Sun, Xin
Zhang, Shaoting
BIG DATA MINING AND ANALYTICS, 2024, 7 (04): : 1116 - 1128
[22] Evaluating Landscape Attractiveness with Geospatial Data, A Case Study in Flanders, Belgium
Vannoppen, Astrid
Degerickx, Jeroen
Gobin, Anne
LAND, 2021, 10 (07)
[23] Evaluating geospatial education provision: a case study of Aotearoa New Zealand
de Roiste, Mairead
Pool, Scott C.
Lowry, John H.
JOURNAL OF GEOGRAPHY IN HIGHER EDUCATION, 2024,
[24] Harnessing Large Language Models for Software Vulnerability Detection: A Comprehensive Benchmarking Study
Tamberg, Karl
Bahsi, Hayretdin
IEEE ACCESS, 2025, 13 : 29698 - 29717
[25] FELM: Benchmarking Factuality Evaluation of Large Language Models
Chen, Shiqi
Zhao, Yiran
Zhang, Jinghan
Chern, I-Chun
Gao, Siyang
Liu, Pengfei
He, Junxian
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[26] Benchmarking Biomedical Relation Knowledge in Large Language Models
Zhang, Fenghui
Yang, Kuo
Zhao, Chenqian
Li, Haixu
Dong, Xin
Tian, Haoyu
Zhou, Xuezhong
BIOINFORMATICS RESEARCH AND APPLICATIONS, PT II, ISBRA 2024, 2024, 14955 : 482 - 495
[27] Benchmarking Cognitive Biases in Large Language Models as Evaluators
Koo, Ryan
Lee, Minhwa
Raheja, Vipul
Park, Jongin
Kim, Zae Myung
Kang, Dongyeop
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 517 - 545
[28] TOMBENCH: Benchmarking Theory of Mind in Large Language Models
Chen, Zhuang
Wu, Jincenzi
Zhou, Jinfeng
Wen, Bosi
Bi, Guanqun
Jiang, Gongyao
Cao, Yaru
Hu, Mengting
Lai, Yunghwei
Xiong, Zexuan
Huang, Minlie
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 15959 - 15983
[29] HiBenchLLM: Historical Inquiry Benchmarking for Large Language Models
Chartier, Mathieu
Dakkoune, Nabil
Bourgeois, Guillaume
Jean, Stephane
DATA & KNOWLEDGE ENGINEERING, 2025, 156
[30] LAraBench: Benchmarking Arabic AI with Large Language Models
Qatar Computing Research Institute, HBKU, Qatar
不详
arXiv, 1600,

← 1 2 3 4 5 →