JAILBREAK ANTIDOTE: RUNTIME SAFETY-UTILITY BALANCE VIA SPARSE REPRESENTATION ADJUSTMENT IN LARGE LANGUAGE MODELS

被引:0
|
作者
Shen, Guobin [1 ,2 ,3 ,4 ]
Zhao, Dongcheng [1 ,2 ,3 ]
Dong, Yiting [1 ,2 ,3 ,4 ]
He, Xiang [1 ,2 ,3 ]
Zeng, Yi [1 ,2 ,3 ,4 ]
机构
[1] Brain-inspired Cognitive Intelligence Lab., Institute of Automation, Chinese Academy of Sciences, China
[2] Beijing Institute of AI Safety and Governance, China
[3] Center for Long-term Artificial Intelligence, China
[4] School of Future Technology, University of Chinese Academy of Sciences, China
来源
关键词
Compendex;
D O I
暂无
中图分类号
学科分类号
摘要
引用
收藏
相关论文
共 9 条
  • [1] Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
    Zhao, Wei
    Li, Zhe
    Li, Yige
    Zhang, Ye
    Sun, Jun
    EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Findings of EMNLP 2024, 2024, : 5094 - 5109
  • [2] Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
    Zhao, Wei
    Li, Zhe
    Li, Yige
    Zhang, Ye
    Sun, Jun
    arXiv,
  • [3] Tender: Accelerating Large Language Models via Tensor Decomposition and Runtime Requantization
    Lee, Jungi
    Lee, Wonbeom
    Sim, Jaewoong
    2024 ACM/IEEE 51ST ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, ISCA 2024, 2024, : 1048 - 1062
  • [4] Overcoming language barriers via machine translation with sparse Mixture-of-Experts fusion of large language models
    Zhu, Shaolin
    Jian, Dong
    Xiong, Deyi
    INFORMATION PROCESSING & MANAGEMENT, 2025, 62 (03)
  • [5] CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
    Ren, Qibing
    Gao, Chang
    Shao, Jing
    Yan, Junchi
    Tan, Xin
    Lam, Wai
    Ma, Lizhuang
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 11437 - 11452
  • [6] Accelerating Sparse Autoencoder Training via Layer-Wise Transfer Learning in Large Language Models
    Ghilardi, Davide
    Belotti, Federico
    Molinari, Marco
    Lim, Jaehyuk
    BlackboxNLP 2024 - 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP - Proceedings of the Workshop, 2024, : 530 - 550
  • [7] A safety realignment framework via subspace-oriented model fusion for large language models
    Yi, Xin
    Zheng, Shunfan
    Wang, Linlin
    Wang, Xiaoling
    He, Liang
    KNOWLEDGE-BASED SYSTEMS, 2024, 306
  • [8] Inverse design of high-performance piezoelectric semiconductors via advanced crystal representation and large language models
    Zhang, Chen
    Lv, Siyuan
    Gong, Haojie
    Cheng, Qianxi
    Guo, Junwei
    Zheng, Duanmu
    Xiao, Hang
    APPLIED PHYSICS LETTERS, 2025, 126 (11)
  • [9] Improving Diversity of Demographic Representation in Large Language Models via Collective-Critiques and Self-Voting
    Lahoti, Preethi
    Blumni, Nicholas
    Ma, Xiao
    Kotikalapudi, Raghavendra
    Potluri, Sahitya
    Tan, Qijun
    Srinivasan, Hansa
    Packer, Ben
    Beirami, Ahmad
    Beutel, Alex
    Chen, Jilin
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 10383 - 10405