Detoxifying Large Language Models via Kahneman-Tversky Optimization

被引:0
|
作者
Li, Qingquan [1 ]
Du, Wenlong [1 ]
Liu, Jin [1 ]
机构
[1] Ant Grp, Hangzhou, Peoples R China
关键词
Large language models; Detoxification; Alignment;
D O I
10.1007/978-981-97-9443-0_36
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Currently, the application of Large Language Models (LLMs) faces significant security threats. Harmful questions and adversarial attack prompts can induce the LLMs to generate toxic responses. Therefore, detoxifying LLMs is a critical research topic to ensure their safe and widespread application. In this paper, we propose an alignment-based detoxification method for LLMs. We utilize Kahneman-Tversky Optimization (KTO) to align LLMs. During the construction of the training dataset, we take into account both the detoxification performance and the potential side effect on the LLMs. For detoxification, we make the LLM preferentially generate safe responses rather than toxic contents when asked with harmful questions and attack prompts. To mitigate the potential side effect on the conversational capabilities of LLMs, we incorporate normal questions into the training data, and ensure that the LLM generate normal answers, rather than safety refusals or unsafe responses. Experimental results show that our method showcase the best detoxification performance among all baseline methods while exerting little negative impact on the LLMs. Moreover, our method even enhance the LLMs' general abilities such as question answering and language understanding. Our proposed method achieve the first place in the NLPCC 2024 Share Task 10 Track 2 with an average score of 52.31.
引用
收藏
页码:409 / 417
页数:9
相关论文
共 50 条
  • [21] Diagnosing infeasible optimization problems using large language models
    Chen, Hao
    Constante-Flores, Gonzalo E.
    Li, Can
    INFOR, 2024, 62 (04) : 573 - 587
  • [22] How to Protect Copyright Data in Optimization of Large Language Models?
    Chu, Timothy
    Song, Zhao
    Yang, Chiwun
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17871 - 17879
  • [23] Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models
    Deng, Yinlin
    Xia, Chunqiu Steven
    Peng, Haoran
    Yang, Chenyuan
    Zhan, Lingming
    PROCEEDINGS OF THE 32ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2023, 2023, : 423 - 435
  • [24] Towards Better Program Obfuscation: Optimization via Language Models
    Liu, Han
    2016 IEEE/ACM 38TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING COMPANION (ICSE-C), 2016, : 680 - 682
  • [25] Exploring Automated Assertion Generation via Large Language Models
    Zhang, Quanjun
    Sun, Weifeng
    Fang, Chunrong
    Yu, Bowen
    Li, Hongyan
    Yan, Meng
    Zhou, Jianyi
    Chen, Zhenyu
    ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2025, 34 (03)
  • [26] Guiding Large Language Models via Directional Stimulus Prompting
    Li, Zekun
    Peng, Baolin
    He, Pengcheng
    Galley, Michel
    Gao, Jianfeng
    Yan, Xifeng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [27] Towards Language-Driven Video Inpainting via Multimodal Large Language Models
    Wu, Jianzong
    Li, Xiangtai
    Si, Chenyang
    Zhou, Shangchen
    Yang, Jingkang
    Zhang, Jiangning
    Li, Yining
    Chen, Kai
    Tong, Yunhai
    Liu, Ziwei
    Loy, Chen Change
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 12501 - 12511
  • [28] Large Language Models are Not Models of Natural Language: They are Corpus Models
    Veres, Csaba
    IEEE ACCESS, 2022, 10 : 61970 - 61979
  • [29] Optimization Methods for Personalizing Large Language Models through Retrieval Augmentation
    Salemi, Alireza
    Kallumadi, Surya
    Zamani, Hamed
    PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 752 - 762
  • [30] A Survey on the Integration and Optimization of Large Language Models in Edge Computing Environments
    Bhardwaj, Sarthak
    Singh, Pardeep
    Pandit, Mohammad Khalid
    2024 16TH INTERNATIONAL CONFERENCE ON COMPUTER AND AUTOMATION ENGINEERING, ICCAE 2024, 2024, : 168 - 172