Evaluating large language models on geospatial tasks: a multiple geospatial task benchmarking study

被引:0
|
作者
Xu, Liuchang [1 ,2 ,5 ]
Zhao, Shuo [1 ]
Lin, Qingming [1 ]
Chen, Luyao [1 ]
Luo, Qianqian [1 ]
Wu, Sensen [2 ]
Ye, Xinyue [3 ,4 ]
Feng, Hailin [1 ]
Du, Zhenhong [2 ]
机构
[1] Zhejiang Agr & Forestry Univ, Sch Math & Comp Sci, Hangzhou, Peoples R China
[2] Zhejiang Univ, Sch Earth Sci, Hangzhou 310058, Peoples R China
[3] Texas A&M Univ, Dept Landscape Architecture & Urban Planning, College Stn, TX USA
[4] Texas A&M Univ, Ctr Geospatial Sci Applicat & Technol, College Stn, TX USA
[5] Sunyard Technol Co Ltd, Financial Big Data Res Inst, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Large language models; ChatGPT; benchmarking; spatial reasoning; prompt engineering;
D O I
10.1080/17538947.2025.2480268
中图分类号
P9 [自然地理学];
学科分类号
0705 ; 070501 ;
摘要
The emergence of large language models like ChatGPT and Gemini has highlighted the need to assess their diverse capabilities. However, their performance on geospatial tasks remains underexplored. This study introduces a novel multi-task spatial evaluation dataset to address this gap, covering twelve task types, including spatial understanding and route planning, with verified answers. We evaluated several models, including OpenAI's gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI's glm-4, Anthropic's claude-3-sonnet-20240229, and MoonShot's moonshot-v1-8k, using a two-phase testing approach: zero-shot testing followed by difficulty-based categorization and prompt tuning. Results show that gpt-4o had the highest overall accuracy in the first phase at 71.3%. Though moonshot-v1-8k performed slightly worse overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on performance, such as the Chain-of-Thought strategy, which boosted gpt-4o's accuracy in route planning from 12.4% to 87.5%, and a one-shot strategy that raised moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.
引用
收藏
页数:32
相关论文
共 50 条
  • [31] BLESS: Benchmarking Large Language Models on Sentence Simplification
    Kew, Tannon
    Chi, Alison
    Vasquez-Rodriguez, Laura
    Agrawal, Sweta
    Aumiller, Dennis
    Alva-Manchego, Fernando
    Shardlow, Matthew
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 13291 - 13309
  • [32] IdSarcasm: Benchmarking and Evaluating Language Models for Indonesian Sarcasm Detection
    Suhartono, Derwin
    Wongso, Wilson
    Tri Handoyo, Alif
    IEEE ACCESS, 2024, 12 : 87323 - 87332
  • [33] TRAM: Benchmarking Temporal Reasoning for Large Language Models
    Wang, Yuqing
    Zhao, Yun
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 6389 - 6415
  • [34] Evaluating the capabilities of large language models using machine learning tasks at inference-time
    Grm, Klemen
    Elektrotehniski Vestnik/Electrotechnical Review, 2023, 90 (05): : 247 - 253
  • [35] Evaluating the capabilities of large language models using machine learning tasks at inference-time
    Grm, Klemen
    ELEKTROTEHNISKI VESTNIK, 2023, 90 (05): : 247 - 253
  • [36] Geospatial Monitoring and Structural Mechanics Models: a Case Study of Sports Structures
    Shults, Roman
    Soltabayeva, Saule
    Seitkazina, Gulnur
    Nukarbekova, Zhupargul
    Kucherenko, Oksana
    11TH INTERNATIONAL CONFERENCE ENVIRONMENTAL ENGINEERING (11TH ICEE), 2020,
  • [37] Adopting Pre-trained Large Language Models for Regional Language Tasks: A Case Study
    Gaikwad, Harsha
    Kiwelekar, Arvind
    Laddha, Manjushree
    Shahare, Shashank
    INTELLIGENT HUMAN COMPUTER INTERACTION, IHCI 2023, PT I, 2024, 14531 : 15 - 25
  • [38] Benchmarking Transformers-based models on French Spoken Language Understanding tasks
    Cattan, Oralie
    Ghannay, Sahar
    Servan, Christophe
    Rosset, Sophie
    INTERSPEECH 2022, 2022, : 1238 - 1242
  • [39] Evaluating Source Code Quality with Large Language Models: a comparative study
    da Silva Simões, Igor Regis
    Venson, Elaine
    arXiv,
  • [40] Can Large Language Models Replace Therapists? Evaluating Performance at Simple Cognitive Behavioral Therapy Tasks
    Hodson, Nathan
    Williamson, Simon
    JMIR AI, 2024, 3