Benchmarking Large Language Models on Controllable Generation under Diversified Instructions

被引:0
|
作者
Chen, Yihan [1 ]
Xu, Benfeng [1 ]
Wang, Quan [2 ]
Liu, Yi [3 ]
Mao, Zhendong [1 ]
机构
[1] Univ Sci & Technol China, Hefei, Peoples R China
[2] Beijing Univ Posts & Telecommun, MOE Key Lab Trustworthy Distributed Comp & Serv, Beijing, Peoples R China
[3] State Key Lab Commun Content Cognit Peoples Daily, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
While large language models (LLMs) have exhibited impressive instruction-following capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. As a significant aspect of LLM alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of LLMs. To address this vacancy, we propose a new benchmark CoDI-Eval to systematically and comprehensively evaluate LLMs' responses to instructions with various constraints. We construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. Specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained sub-categories. Finally, we automate the entire evaluation process to facilitate further developments. Different from existing studies on controllable text generation, CoDI-Eval extends the scope to the prevalent instruction-following paradigm for the first time. We provide extensive evaluations of representative LLMs (e.g., ChatGPT, Vicuna) on CoDI-Eval, revealing their limitations in following instructions with specific constraints and there is still a significant gap between open-source and commercial closed-source LLMs. We believe this benchmark will facilitate research into improving the controllability of LLMs' responses to instructions. Our data and code are available at https://github.com/Xt-cyh/CoDI-Eval.
引用
收藏
页码:17808 / 17816
页数:9
相关论文
共 50 条
  • [21] Benchmarking Deep Graph Models for Large Molecular Generation
    Park, Jin-Jun
    Sael, Lee
    2022 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (IEEE BIGCOMP 2022), 2022, : 114 - 120
  • [22] Language-controllable programmable metasurface empowered by large language models
    Hu, Shengguo
    Xu, Jiawen
    Li, Mingyi
    Cui, Tie Jun
    Li, Lianlin
    NANOPHOTONICS, 2024, 13 (12) : 2213 - 2222
  • [23] Benchmarking large language models for biomedical natural language processing applications and recommendations
    Chen, Qingyu
    Hu, Yan
    Peng, Xueqing
    Xie, Qianqian
    Jin, Qiao
    Gilson, Aidan
    Singer, Maxwell B.
    Ai, Xuguang
    Lai, Po-Ting
    Wang, Zhizheng
    Keloth, Vipina K.
    Raja, Kalpana
    Huang, Jimin
    He, Huan
    Lin, Fongci
    Du, Jingcheng
    Zhang, Rui
    Zheng, W. Jim
    Adelman, Ron A.
    Lu, Zhiyong
    Xu, Hua
    NATURE COMMUNICATIONS, 2025, 16 (01)
  • [24] SEED-Bench: Benchmarking Multimodal Large Language Models
    Li, Bohao
    Ge, Yuying
    Ge, Yixiao
    Wang, Guangzhi
    Wang, Rui
    Zhang, Ruimao
    Shi, Ying
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13299 - 13308
  • [25] Quantifying Bias in Agentic Large Language Models: A Benchmarking Approach
    Fernando, Riya
    Norton, Isabel
    Dogra, Pranay
    Sarnaik, Rohit
    Wazir, Hasan
    Ren, Zitang
    Gunda, Niveta Sree
    Mukhopadhyay, Anushka
    Lutz, Michael
    2024 5TH INFORMATION COMMUNICATION TECHNOLOGIES CONFERENCE, ICTC 2024, 2024, : 349 - 353
  • [26] Benchmarking Large Language Models for Log Analysis, Security, and Interpretation
    Karlsen, Egil
    Luo, Xiao
    Zincir-Heywood, Nur
    Heywood, Malcolm
    JOURNAL OF NETWORK AND SYSTEMS MANAGEMENT, 2024, 32 (03)
  • [27] Enabling controllable table-to-text generation via prompting large language models with guided planning
    Zhao, Shuo
    Sun, Xin
    KNOWLEDGE-BASED SYSTEMS, 2024, 304
  • [28] Dissecting Dissonance: Benchmarking Large Multimodal Models Against Self-Contradictory Instructions
    Gao, Jin
    Gan, Lei
    Li, Yuankai
    Ye, Yixin
    Wang, Dequan
    COMPUTER VISION-ECCV 2024, PT LVII, 2025, 15115 : 404 - 420
  • [29] Benchmarking Large Language Models on CFLUE - A Chinese Financial Language Understanding Evaluation Dataset
    Zhu, Jie
    Li, Junhui
    Wen, Yalong
    Guo, Lifan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 5673 - 5693
  • [30] Enhancing the Readability of Preoperative Patient Instructions Using Large Language Models
    Hong, Hyo Jung
    Schmiesing, Clifford A.
    Goodell, Alex J.
    ANESTHESIOLOGY, 2024, 141 (03) : 608 - 610