Benchmarking Large Language Models: Opportunities and Challenges

被引:0
|
作者
Hodak, Miro [1 ]
Ellison, David [2 ]
Van Buren, Chris [2 ]
Jiang, Xiaotong [2 ]
Dholakia, Ajay [2 ]
机构
[1] AMD, Data Ctr Solut Grp, Austin, TX 78735 USA
[2] Lenovo, Infrastruct Solut Grp, Morrisville, NC USA
关键词
Artificial Intelligence; Inference; Training; MLPerf; TPCx-AI; Deep Learning; Performance; Large Language Models;
D O I
10.1007/978-3-031-68031-1_6
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With exponentially growing popularity of Large Language Models (LLMs) and LLM-based applications like ChatGPT and Bard, the Artificial Intelligence (AI) community of developers and users are in need of representative benchmarks to enable careful comparison across a variety of use cases. The set of metrics has grown beyond accuracy and throughput to include energy efficiency, bias, trust and sustainability. This paper aims to provide an overview of popular LLMs from a benchmarking perspective. Key LLMs are described, and the associated datasets are characterized. A detailed discussion of benchmarking metrics covering training and inference stages is provided and challenges in evaluating these metrics are highlighted. A review of recent performance and benchmark submissions is included, and emerging trends are summarized. The paper lays the foundation for developing new benchmarks to allow informed comparison of different AI systems based on combinations of models, datasets, and metrics.
引用
收藏
页码:77 / 89
页数:13
相关论文
共 50 条
  • [21] Large language models for life cycle assessments: Opportunities, challenges, and risks
    Preuss, Nathan
    Alshehri, Abdulelah S.
    You, Fengqi
    JOURNAL OF CLEANER PRODUCTION, 2024, 466
  • [22] Foundation and large language models: fundamentals, challenges, opportunities, and social impacts
    Devon Myers
    Rami Mohawesh
    Venkata Ishwarya Chellaboina
    Anantha Lakshmi Sathvik
    Praveen Venkatesh
    Yi-Hui Ho
    Hanna Henshaw
    Muna Alhawawreh
    David Berdik
    Yaser Jararweh
    Cluster Computing, 2024, 27 : 1 - 26
  • [23] Benchmarking DNA large language models on quadruplexes
    Cherednichenko, Oleksandr
    Herbert, Alan
    Poptsova, Maria
    COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2025, 27 : 992 - 1000
  • [24] Benchmarking AutoGen with different large language models
    Barbarroxa, Rafael
    Ribeiro, Bruno
    Gomes, Luis
    Vale, Zita
    2024 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI 2024, 2024, : 263 - 264
  • [25] Benchmarking Large Language Models for News Summarization
    Zhang, Tianyi
    Ladhak, Faisal
    Durmus, Esin
    Liang, Percy
    Mckeown, Kathleen
    Hashimoto, Tatsunori B.
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 39 - 57
  • [26] Large language models for qualitative research in software engineering: exploring opportunities and challenges
    Muneera Bano
    Rashina Hoda
    Didar Zowghi
    Christoph Treude
    Automated Software Engineering, 2024, 31
  • [27] Harnessing the potential of large language models in medicine: opportunities, challenges, and ethical considerations
    Zhou, Zhaohui
    Gan, Wenyi
    Xie, Jiarui
    Guo, Zeji
    Zhang, Zhiling
    INTERNATIONAL JOURNAL OF SURGERY, 2024, 110 (09) : 5850 - 5851
  • [28] Integrating Large Language Models in Bioinformatics Education for Medical Students: Opportunities and Challenges
    Kang, Kai
    Yang, Yuqi
    Wu, Yijun
    Luo, Ren
    ANNALS OF BIOMEDICAL ENGINEERING, 2024, 52 (09) : 2311 - 2315
  • [29] Large language models for qualitative research in software engineering: exploring opportunities and challenges
    Bano, Muneera
    Hoda, Rashina
    Zowghi, Didar
    Treude, Christoph
    AUTOMATED SOFTWARE ENGINEERING, 2024, 31 (01)
  • [30] The role of large language models in interdisciplinary research: Opportunities, challenges and ways forward
    Mammides, Christos
    Papadopoulos, Harris
    METHODS IN ECOLOGY AND EVOLUTION, 2024, 15 (10): : 1774 - 1776