Detoxifying Large Language Models via Kahneman-Tversky Optimization

被引:0
|
作者
Li, Qingquan [1 ]
Du, Wenlong [1 ]
Liu, Jin [1 ]
机构
[1] Ant Grp, Hangzhou, Peoples R China
关键词
Large language models; Detoxification; Alignment;
D O I
10.1007/978-981-97-9443-0_36
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Currently, the application of Large Language Models (LLMs) faces significant security threats. Harmful questions and adversarial attack prompts can induce the LLMs to generate toxic responses. Therefore, detoxifying LLMs is a critical research topic to ensure their safe and widespread application. In this paper, we propose an alignment-based detoxification method for LLMs. We utilize Kahneman-Tversky Optimization (KTO) to align LLMs. During the construction of the training dataset, we take into account both the detoxification performance and the potential side effect on the LLMs. For detoxification, we make the LLM preferentially generate safe responses rather than toxic contents when asked with harmful questions and attack prompts. To mitigate the potential side effect on the conversational capabilities of LLMs, we incorporate normal questions into the training data, and ensure that the LLM generate normal answers, rather than safety refusals or unsafe responses. Experimental results show that our method showcase the best detoxification performance among all baseline methods while exerting little negative impact on the LLMs. Moreover, our method even enhance the LLMs' general abilities such as question answering and language understanding. Our proposed method achieve the first place in the NLPCC 2024 Share Task 10 Track 2 with an average score of 52.31.
引用
收藏
页码:409 / 417
页数:9
相关论文
共 50 条
  • [31] Leveraging Large Language Models for the Generation of Novel Metaheuristic Optimization Algorithms
    Pluhacek, Michal
    Kazikova, Anezka
    Kadavy, Tomas
    Viktorin, Adam
    Senkerik, Roman
    PROCEEDINGS OF THE 2023 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE COMPANION, GECCO 2023 COMPANION, 2023, : 1812 - 1820
  • [32] Robust Prompt Optimization for Large Language Models Against Distribution Shifts
    Li, Moxin
    Wang, Wenjie
    Feng, Fuli
    Cao, Yixin
    Zhang, Jizhi
    Chua, Tat-Seng
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 1539 - 1554
  • [33] Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
    Jia, Xiaojun
    Pang, Tianyu
    Du, Chao
    Huang, Yihao
    Gu, Jindong
    Liu, Yang
    Cao, Xiaochun
    Lin, Min
    arXiv,
  • [34] Large Language Models
    Vargas, Diego Collarana
    Katsamanis, Nassos
    ERCIM NEWS, 2024, (136): : 12 - 13
  • [35] Large Language Models
    Cerf, Vinton G.
    COMMUNICATIONS OF THE ACM, 2023, 66 (08) : 7 - 7
  • [36] P-TA: Using Proximal Policy Optimization to Enhance Tabular Data Augmentation via Large Language Models
    Yang, Shuo
    Yuan, Chenchen
    Rong, Yao
    Steinbauer, Felix
    Kasneci, Gjergji
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 248 - 264
  • [37] VISA: Reasoning Video Object Segmentation via Large Language Models
    Yan, Cilin
    Wang, Haochen
    Yan, Shilin
    Jiang, Xiaolong
    Hu, Yao
    Kang, Guoliang
    Xie, Weidi
    Gavves, Efstratios
    COMPUTER VISION - ECCV 2024, PT XV, 2025, 15073 : 98 - 115
  • [38] Data Stealing Attacks against Large Language Models via Backdooring
    He, Jiaming
    Hou, Guanyu
    Jia, Xinyue
    Chen, Yangyang
    Liao, Wenqi
    Zhou, Yinhang
    Zhou, Rang
    ELECTRONICS, 2024, 13 (14)
  • [39] Time Series Classification With Large Language Models via Linguistic Scaffolding
    Jang, Hyeongwon
    Yong Yang, June
    Hwang, Jaeryong
    Yang, Eunho
    IEEE ACCESS, 2024, 12 : 170387 - 170398
  • [40] Capturing Failures of Large Language Models via Human Cognitive Biases
    Jones, Erik
    Steinhardt, Jacob
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,