Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

被引：0

作者：

Chen, Yihan ^{[1
]}

Xu, Benfeng ^{[1
]}

Wang, Quan ^{[2
]}

Liu, Yi ^{[3
]}

Mao, Zhendong ^{[1
]}

机构：

[1] Univ Sci & Technol China, Hefei, Peoples R China

[2] Beijing Univ Posts & Telecommun, MOE Key Lab Trustworthy Distributed Comp & Serv, Beijing, Peoples R China

[3] State Key Lab Commun Content Cognit Peoples Daily, Beijing, Peoples R China

来源：

THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16 | 2024年

基金：

中国国家自然科学基金;

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

While large language models (LLMs) have exhibited impressive instruction-following capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. As a significant aspect of LLM alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of LLMs. To address this vacancy, we propose a new benchmark CoDI-Eval to systematically and comprehensively evaluate LLMs' responses to instructions with various constraints. We construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. Specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained sub-categories. Finally, we automate the entire evaluation process to facilitate further developments. Different from existing studies on controllable text generation, CoDI-Eval extends the scope to the prevalent instruction-following paradigm for the first time. We provide extensive evaluations of representative LLMs (e.g., ChatGPT, Vicuna) on CoDI-Eval, revealing their limitations in following instructions with specific constraints and there is still a significant gap between open-source and commercial closed-source LLMs. We believe this benchmark will facilitate research into improving the controllability of LLMs' responses to instructions. Our data and code are available at https://github.com/Xt-cyh/CoDI-Eval.

引用

页码：17808 / 17816

页数：9

共 50 条

[21] Benchmarking Deep Graph Models for Large Molecular Generation
Park, Jin-Jun
Sael, Lee
2022 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (IEEE BIGCOMP 2022), 2022, : 114 - 120
[22] Language-controllable programmable metasurface empowered by large language models
Hu, Shengguo
Xu, Jiawen
Li, Mingyi
Cui, Tie Jun
Li, Lianlin
NANOPHOTONICS, 2024, 13 (12) : 2213 - 2222
[23] Benchmarking large language models for biomedical natural language processing applications and recommendations
Chen, Qingyu
Hu, Yan
Peng, Xueqing
Xie, Qianqian
Jin, Qiao
Gilson, Aidan
Singer, Maxwell B.
Ai, Xuguang
Lai, Po-Ting
Wang, Zhizheng
Keloth, Vipina K.
Raja, Kalpana
Huang, Jimin
He, Huan
Lin, Fongci
Du, Jingcheng
Zhang, Rui
Zheng, W. Jim
Adelman, Ron A.
Lu, Zhiyong
Xu, Hua
NATURE COMMUNICATIONS, 2025, 16 (01)
[24] SEED-Bench: Benchmarking Multimodal Large Language Models
Li, Bohao
Ge, Yuying
Ge, Yixiao
Wang, Guangzhi
Wang, Rui
Zhang, Ruimao
Shi, Ying
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13299 - 13308
[25] Quantifying Bias in Agentic Large Language Models: A Benchmarking Approach
Fernando, Riya
Norton, Isabel
Dogra, Pranay
Sarnaik, Rohit
Wazir, Hasan
Ren, Zitang
Gunda, Niveta Sree
Mukhopadhyay, Anushka
Lutz, Michael
2024 5TH INFORMATION COMMUNICATION TECHNOLOGIES CONFERENCE, ICTC 2024, 2024, : 349 - 353
[26] Benchmarking Large Language Models for Log Analysis, Security, and Interpretation
Karlsen, Egil
Luo, Xiao
Zincir-Heywood, Nur
Heywood, Malcolm
JOURNAL OF NETWORK AND SYSTEMS MANAGEMENT, 2024, 32 (03)
[27] Enabling controllable table-to-text generation via prompting large language models with guided planning
Zhao, Shuo
Sun, Xin
KNOWLEDGE-BASED SYSTEMS, 2024, 304
[28] Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions
Gao, Jin
Gan, Lei
Li, Yuankai
Ye, Yixin
Wang, Dequan
COMPUTER VISION-ECCV 2024, PT LVII, 2025, 15115 : 404 - 420
[29] Benchmarking Large Language Models on CFLUE - A Chinese Financial Language Understanding Evaluation Dataset
Zhu, Jie
Li, Junhui
Wen, Yalong
Guo, Lifan
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 5673 - 5693
[30] Enhancing the Readability of Preoperative Patient Instructions Using Large Language Models
Hong, Hyo Jung
Schmiesing, Clifford A.
Goodell, Alex J.
ANESTHESIOLOGY, 2024, 141 (03) : 608 - 610

← 1 2 3 4 5 →