Evaluating large language models on geospatial tasks: a multiple geospatial task benchmarking study

被引：0

作者：

Xu, Liuchang ^{[1
,2
,5
]}

Zhao, Shuo ^{[1
]}

Lin, Qingming ^{[1
]}

Chen, Luyao ^{[1
]}

Luo, Qianqian ^{[1
]}

Wu, Sensen ^{[2
]}

Ye, Xinyue ^{[3
,4
]}

Feng, Hailin ^{[1
]}

Du, Zhenhong ^{[2
]}

机构：

[1] Zhejiang Agr & Forestry Univ, Sch Math & Comp Sci, Hangzhou, Peoples R China

[2] Zhejiang Univ, Sch Earth Sci, Hangzhou 310058, Peoples R China

[3] Texas A&M Univ, Dept Landscape Architecture & Urban Planning, College Stn, TX USA

[4] Texas A&M Univ, Ctr Geospatial Sci Applicat & Technol, College Stn, TX USA

[5] Sunyard Technol Co Ltd, Financial Big Data Res Inst, Hangzhou, Peoples R China

来源：

INTERNATIONAL JOURNAL OF DIGITAL EARTH | 2025年 / 18卷 / 01期

基金：

中国国家自然科学基金;

关键词：

Large language models; ChatGPT; benchmarking; spatial reasoning; prompt engineering;

D O I：

10.1080/17538947.2025.2480268

中图分类号：

P9 [自然地理学];

学科分类号：

0705 ; 070501 ;

摘要：

The emergence of large language models like ChatGPT and Gemini has highlighted the need to assess their diverse capabilities. However, their performance on geospatial tasks remains underexplored. This study introduces a novel multi-task spatial evaluation dataset to address this gap, covering twelve task types, including spatial understanding and route planning, with verified answers. We evaluated several models, including OpenAI's gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI's glm-4, Anthropic's claude-3-sonnet-20240229, and MoonShot's moonshot-v1-8k, using a two-phase testing approach: zero-shot testing followed by difficulty-based categorization and prompt tuning. Results show that gpt-4o had the highest overall accuracy in the first phase at 71.3%. Though moonshot-v1-8k performed slightly worse overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on performance, such as the Chain-of-Thought strategy, which boosted gpt-4o's accuracy in route planning from 12.4% to 87.5%, and a one-shot strategy that raised moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.

引用

页数：32

共 50 条

[31] BLESS: Benchmarking Large Language Models on Sentence Simplification
Kew, Tannon
Chi, Alison
Vasquez-Rodriguez, Laura
Agrawal, Sweta
Aumiller, Dennis
Alva-Manchego, Fernando
Shardlow, Matthew
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 13291 - 13309
[32] IdSarcasm: Benchmarking and Evaluating Language Models for Indonesian Sarcasm Detection
Suhartono, Derwin
Wongso, Wilson
Tri Handoyo, Alif
IEEE ACCESS, 2024, 12 : 87323 - 87332
[33] TRAM: Benchmarking Temporal Reasoning for Large Language Models
Wang, Yuqing
Zhao, Yun
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 6389 - 6415
[34] Evaluating the capabilities of large language models using machine learning tasks at inference-time
Grm, Klemen
Elektrotehniski Vestnik/Electrotechnical Review, 2023, 90 (05): : 247 - 253
[35] Evaluating the capabilities of large language models using machine learning tasks at inference-time
Grm, Klemen
ELEKTROTEHNISKI VESTNIK, 2023, 90 (05): : 247 - 253
[36] Geospatial Monitoring and Structural Mechanics Models: a Case Study of Sports Structures
Shults, Roman
Soltabayeva, Saule
Seitkazina, Gulnur
Nukarbekova, Zhupargul
Kucherenko, Oksana
11TH INTERNATIONAL CONFERENCE ENVIRONMENTAL ENGINEERING (11TH ICEE), 2020,
[37] Adopting Pre-trained Large Language Models for Regional Language Tasks: A Case Study
Gaikwad, Harsha
Kiwelekar, Arvind
Laddha, Manjushree
Shahare, Shashank
INTELLIGENT HUMAN COMPUTER INTERACTION, IHCI 2023, PT I, 2024, 14531 : 15 - 25
[38] Benchmarking Transformers-based models on French Spoken Language Understanding tasks
Cattan, Oralie
Ghannay, Sahar
Servan, Christophe
Rosset, Sophie
INTERSPEECH 2022, 2022, : 1238 - 1242
[39] Evaluating Source Code Quality with Large Language Models: a comparative study
da Silva Simões, Igor Regis
Venson, Elaine
arXiv,
[40] Can Large Language Models Replace Therapists? Evaluating Performance at Simple Cognitive Behavioral Therapy Tasks
Hodson, Nathan
Williamson, Simon
JMIR AI, 2024, 3

← 1 2 3 4 5 →