Evaluating large language models on geospatial tasks: a multiple geospatial task benchmarking study

被引:0
|
作者
Xu, Liuchang [1 ,2 ,5 ]
Zhao, Shuo [1 ]
Lin, Qingming [1 ]
Chen, Luyao [1 ]
Luo, Qianqian [1 ]
Wu, Sensen [2 ]
Ye, Xinyue [3 ,4 ]
Feng, Hailin [1 ]
Du, Zhenhong [2 ]
机构
[1] Zhejiang Agr & Forestry Univ, Sch Math & Comp Sci, Hangzhou, Peoples R China
[2] Zhejiang Univ, Sch Earth Sci, Hangzhou 310058, Peoples R China
[3] Texas A&M Univ, Dept Landscape Architecture & Urban Planning, College Stn, TX USA
[4] Texas A&M Univ, Ctr Geospatial Sci Applicat & Technol, College Stn, TX USA
[5] Sunyard Technol Co Ltd, Financial Big Data Res Inst, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
Large language models; ChatGPT; benchmarking; spatial reasoning; prompt engineering;
D O I
10.1080/17538947.2025.2480268
中图分类号
P9 [自然地理学];
学科分类号
0705 ; 070501 ;
摘要
The emergence of large language models like ChatGPT and Gemini has highlighted the need to assess their diverse capabilities. However, their performance on geospatial tasks remains underexplored. This study introduces a novel multi-task spatial evaluation dataset to address this gap, covering twelve task types, including spatial understanding and route planning, with verified answers. We evaluated several models, including OpenAI's gpt-3.5-turbo, gpt-4-turbo, gpt-4o, ZhipuAI's glm-4, Anthropic's claude-3-sonnet-20240229, and MoonShot's moonshot-v1-8k, using a two-phase testing approach: zero-shot testing followed by difficulty-based categorization and prompt tuning. Results show that gpt-4o had the highest overall accuracy in the first phase at 71.3%. Though moonshot-v1-8k performed slightly worse overall, it outperformed gpt-4o in place name recognition tasks. The study also highlights the impact of prompt strategies on performance, such as the Chain-of-Thought strategy, which boosted gpt-4o's accuracy in route planning from 12.4% to 87.5%, and a one-shot strategy that raised moonshot-v1-8k's accuracy in mapping tasks from 10.1% to 76.3%.
引用
收藏
页数:32
相关论文
共 50 条
  • [41] Sources of Hallucination by Large Language Models on Inference Tasks
    McKenna, Nick
    Li, Tianyi
    Cheng, Liang
    Hosseini, Mohammad Javad
    Johnson, Mark
    Steedman, Mark
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 2758 - 2774
  • [42] Benchmarking large language models for biomedical natural language processing applications and recommendations
    Chen, Qingyu
    Hu, Yan
    Peng, Xueqing
    Xie, Qianqian
    Jin, Qiao
    Gilson, Aidan
    Singer, Maxwell B.
    Ai, Xuguang
    Lai, Po-Ting
    Wang, Zhizheng
    Keloth, Vipina K.
    Raja, Kalpana
    Huang, Jimin
    He, Huan
    Lin, Fongci
    Du, Jingcheng
    Zhang, Rui
    Zheng, W. Jim
    Adelman, Ron A.
    Lu, Zhiyong
    Xu, Hua
    NATURE COMMUNICATIONS, 2025, 16 (01)
  • [43] Facilitating Autonomous Driving Tasks With Large Language Models
    Wu, Mengyao
    Yu, F. Richard
    Liu, Peter Xiaoping
    He, Ying
    IEEE INTELLIGENT SYSTEMS, 2025, 40 (01) : 45 - 52
  • [44] Benchmarking Large Language Models in Retrieval-Augmented Generation
    Chen, Jiawei
    Lin, Hongyu
    Han, Xianpei
    Sun, Le
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17754 - 17762
  • [45] SEED-Bench: Benchmarking Multimodal Large Language Models
    Li, Bohao
    Ge, Yuying
    Ge, Yixiao
    Wang, Guangzhi
    Wang, Rui
    Zhang, Ruimao
    Shi, Ying
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13299 - 13308
  • [46] Quantifying Bias in Agentic Large Language Models: A Benchmarking Approach
    Fernando, Riya
    Norton, Isabel
    Dogra, Pranay
    Sarnaik, Rohit
    Wazir, Hasan
    Ren, Zitang
    Gunda, Niveta Sree
    Mukhopadhyay, Anushka
    Lutz, Michael
    2024 5TH INFORMATION COMMUNICATION TECHNOLOGIES CONFERENCE, ICTC 2024, 2024, : 349 - 353
  • [47] Benchmarking Large Language Models for Log Analysis, Security, and Interpretation
    Karlsen, Egil
    Luo, Xiao
    Zincir-Heywood, Nur
    Heywood, Malcolm
    JOURNAL OF NETWORK AND SYSTEMS MANAGEMENT, 2024, 32 (03)
  • [48] Robustness of GPT Large Language Models on Natural Language Processing Tasks
    Xuanting C.
    Junjie Y.
    Can Z.
    Nuo X.
    Tao G.
    Qi Z.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2024, 61 (05): : 1128 - 1142
  • [49] GEE-OPs: An Operator Knowledge Base for Geospatial Code Generation on the Google Earth Engine Platform Powered by Large Language Models
    Hou, Shuyang
    Liang, Jianyuan
    Zhao, Anqi
    Wu, Huayi
    arXiv,
  • [50] Benchmarking open-source large language models on Portuguese Revalida multiple-choice questions
    Severino, Joao Victor Bruneti
    de Paula, Pedro Angelo Basei
    Berger, Matheus Nespolo
    Loures, Filipe Silveira
    Todeschini, Solano Amadori
    Roeder, Eduardo Augusto
    Veiga, Maria Han
    Guedes, Murilo
    Marques, Gustavo Lenci
    BMJ HEALTH & CARE INFORMATICS, 2025, 32 (01)