Detoxifying Large Language Models via Kahneman-Tversky Optimization

被引：0

作者：

Li, Qingquan ^{[1
]}

Du, Wenlong ^{[1
]}

Liu, Jin ^{[1
]}

机构：

[1] Ant Grp, Hangzhou, Peoples R China

来源：

NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT V, NLPCC 2024 | 2025年 / 15363卷

关键词：

Large language models; Detoxification; Alignment;

D O I：

10.1007/978-981-97-9443-0_36

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Currently, the application of Large Language Models (LLMs) faces significant security threats. Harmful questions and adversarial attack prompts can induce the LLMs to generate toxic responses. Therefore, detoxifying LLMs is a critical research topic to ensure their safe and widespread application. In this paper, we propose an alignment-based detoxification method for LLMs. We utilize Kahneman-Tversky Optimization (KTO) to align LLMs. During the construction of the training dataset, we take into account both the detoxification performance and the potential side effect on the LLMs. For detoxification, we make the LLM preferentially generate safe responses rather than toxic contents when asked with harmful questions and attack prompts. To mitigate the potential side effect on the conversational capabilities of LLMs, we incorporate normal questions into the training data, and ensure that the LLM generate normal answers, rather than safety refusals or unsafe responses. Experimental results show that our method showcase the best detoxification performance among all baseline methods while exerting little negative impact on the LLMs. Moreover, our method even enhance the LLMs' general abilities such as question answering and language understanding. Our proposed method achieve the first place in the NLPCC 2024 Share Task 10 Track 2 with an average score of 52.31.

引用

页码：409 / 417

页数：9

共 50 条

[21] Diagnosing infeasible optimization problems using large language models
Chen, Hao
Constante-Flores, Gonzalo E.
Li, Can
INFOR, 2024, 62 (04) : 573 - 587
[22] How to Protect Copyright Data in Optimization of Large Language Models?
Chu, Timothy
Song, Zhao
Yang, Chiwun
THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 16, 2024, : 17871 - 17879
[23] Large Language Models Are Zero-Shot Fuzzers: Fuzzing Deep-Learning Libraries via Large Language Models
Deng, Yinlin
Xia, Chunqiu Steven
Peng, Haoran
Yang, Chenyuan
Zhan, Lingming
PROCEEDINGS OF THE 32ND ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2023, 2023, : 423 - 435
[24] Towards Better Program Obfuscation: Optimization via Language Models
Liu, Han
2016 IEEE/ACM 38TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING COMPANION (ICSE-C), 2016, : 680 - 682
[25] Exploring Automated Assertion Generation via Large Language Models
Zhang, Quanjun
Sun, Weifeng
Fang, Chunrong
Yu, Bowen
Li, Hongyan
Yan, Meng
Zhou, Jianyi
Chen, Zhenyu
ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2025, 34 (03)
[26] Guiding Large Language Models via Directional Stimulus Prompting
Li, Zekun
Peng, Baolin
He, Pengcheng
Galley, Michel
Gao, Jianfeng
Yan, Xifeng
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
[27] Towards Language-Driven Video Inpainting via Multimodal Large Language Models
Wu, Jianzong
Li, Xiangtai
Si, Chenyang
Zhou, Shangchen
Yang, Jingkang
Zhang, Jiangning
Li, Yining
Chen, Kai
Tong, Yunhai
Liu, Ziwei
Loy, Chen Change
2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 12501 - 12511
[28] Large Language Models are Not Models of Natural Language: They are Corpus Models
Veres, Csaba
IEEE ACCESS, 2022, 10 : 61970 - 61979
[29] Optimization Methods for Personalizing Large Language Models through Retrieval Augmentation
Salemi, Alireza
Kallumadi, Surya
Zamani, Hamed
PROCEEDINGS OF THE 47TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2024, 2024, : 752 - 762
[30] A Survey on the Integration and Optimization of Large Language Models in Edge Computing Environments
Bhardwaj, Sarthak
Singh, Pardeep
Pandit, Mohammad Khalid
2024 16TH INTERNATIONAL CONFERENCE ON COMPUTER AND AUTOMATION ENGINEERING, ICCAE 2024, 2024, : 168 - 172

← 1 2 3 4 5 →