Benchmarking Large Language Models: Opportunities and Challenges

被引：0

作者：

Hodak, Miro ^{[1
]}

Ellison, David ^{[2
]}

Van Buren, Chris ^{[2
]}

Jiang, Xiaotong ^{[2
]}

Dholakia, Ajay ^{[2
]}

机构：

[1] AMD, Data Ctr Solut Grp, Austin, TX 78735 USA

[2] Lenovo, Infrastruct Solut Grp, Morrisville, NC USA

来源：

PERFORMANCE EVALUATION AND BENCHMARKING, TPCTC 2023 | 2024年 / 14247卷

关键词：

Artificial Intelligence; Inference; Training; MLPerf; TPCx-AI; Deep Learning; Performance; Large Language Models;

D O I：

10.1007/978-3-031-68031-1_6

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

With exponentially growing popularity of Large Language Models (LLMs) and LLM-based applications like ChatGPT and Bard, the Artificial Intelligence (AI) community of developers and users are in need of representative benchmarks to enable careful comparison across a variety of use cases. The set of metrics has grown beyond accuracy and throughput to include energy efficiency, bias, trust and sustainability. This paper aims to provide an overview of popular LLMs from a benchmarking perspective. Key LLMs are described, and the associated datasets are characterized. A detailed discussion of benchmarking metrics covering training and inference stages is provided and challenges in evaluating these metrics are highlighted. A review of recent performance and benchmark submissions is included, and emerging trends are summarized. The paper lays the foundation for developing new benchmarks to allow informed comparison of different AI systems based on combinations of models, datasets, and metrics.

引用

页码：77 / 89

页数：13

共 50 条

[41] LAraBench: Benchmarking Arabic AI with Large Language Models
Qatar Computing Research Institute, HBKU, Qatar
不详
arXiv, 1600,
[42] BLESS: Benchmarking Large Language Models on Sentence Simplification
Kew, Tannon
Chi, Alison
Vasquez-Rodriguez, Laura
Agrawal, Sweta
Aumiller, Dennis
Alva-Manchego, Fernando
Shardlow, Matthew
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 13291 - 13309
[43] Integration of Advanced Large Language Models into the Construction of Adverse Outcome Pathways: Opportunities and Challenges
Shi, Haochun
Zhao, Yanbin
ENVIRONMENTAL SCIENCE & TECHNOLOGY, 2024, 58 (35) : 15355 - 15358
[44] Embedding Large Language Models into Extended Reality: Opportunities and Challenges for Inclusion, Engagement, and Privacy
Bozkir, Efe
Ozdel, Suleyman
Lau, Ka Hei Carrie
Wang, Mengdi
Gao, Hong
Kasneci, Enkelejda
PROCEEDINGS OF THE 6TH CONFERENCE ON ACM CONVERSATIONAL USER INTERFACES, CUI 2024, 2024,
[45] TRAM: Benchmarking Temporal Reasoning for Large Language Models
Wang, Yuqing
Zhao, Yun
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 6389 - 6415
[46] Debiasing large language models: research opportunities
Yogarajan, Vithya
Dobbie, Gillian
Keegan, Te Taka
JOURNAL OF THE ROYAL SOCIETY OF NEW ZEALAND, 2025, 55 (02) : 372 - 395
[47] Large Language Models in Education Embracing opportunities, confronting challenges, and shaping the next chapter together
Liu, Bingbin B.
XRDS: Crossroads, 2024, 31 (01): : 7 - 9
[48] Benchmarking large language models for biomedical natural language processing applications and recommendations
Chen, Qingyu
Hu, Yan
Peng, Xueqing
Xie, Qianqian
Jin, Qiao
Gilson, Aidan
Singer, Maxwell B.
Ai, Xuguang
Lai, Po-Ting
Wang, Zhizheng
Keloth, Vipina K.
Raja, Kalpana
Huang, Jimin
He, Huan
Lin, Fongci
Du, Jingcheng
Zhang, Rui
Zheng, W. Jim
Adelman, Ron A.
Lu, Zhiyong
Xu, Hua
NATURE COMMUNICATIONS, 2025, 16 (01)
[49] Benchmarking Large Language Models in Retrieval-Augmented Generation
Chen, Jiawei
Lin, Hongyu
Han, Xianpei
Sun, Le
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17754 - 17762
[50] SEED-Bench: Benchmarking Multimodal Large Language Models
Li, Bohao
Ge, Yuying
Ge, Yixiao
Wang, Guangzhi
Wang, Rui
Zhang, Ruimao
Shi, Ying
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13299 - 13308

← 1 2 3 4 5 →