Detoxifying Large Language Models via Kahneman-Tversky Optimization

被引：0

作者：

Li, Qingquan ^{[1
]}

Du, Wenlong ^{[1
]}

Liu, Jin ^{[1
]}

机构：

[1] Ant Grp, Hangzhou, Peoples R China

来源：

NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT V, NLPCC 2024 | 2025年 / 15363卷

关键词：

Large language models; Detoxification; Alignment;

D O I：

10.1007/978-981-97-9443-0_36

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Currently, the application of Large Language Models (LLMs) faces significant security threats. Harmful questions and adversarial attack prompts can induce the LLMs to generate toxic responses. Therefore, detoxifying LLMs is a critical research topic to ensure their safe and widespread application. In this paper, we propose an alignment-based detoxification method for LLMs. We utilize Kahneman-Tversky Optimization (KTO) to align LLMs. During the construction of the training dataset, we take into account both the detoxification performance and the potential side effect on the LLMs. For detoxification, we make the LLM preferentially generate safe responses rather than toxic contents when asked with harmful questions and attack prompts. To mitigate the potential side effect on the conversational capabilities of LLMs, we incorporate normal questions into the training data, and ensure that the LLM generate normal answers, rather than safety refusals or unsafe responses. Experimental results show that our method showcase the best detoxification performance among all baseline methods while exerting little negative impact on the LLMs. Moreover, our method even enhance the LLMs' general abilities such as question answering and language understanding. Our proposed method achieve the first place in the NLPCC 2024 Share Task 10 Track 2 with an average score of 52.31.

引用

页码：409 / 417

页数：9

共 50 条

[31] Leveraging Large Language Models for the Generation of Novel Metaheuristic Optimization Algorithms
Pluhacek, Michal
Kazikova, Anezka
Kadavy, Tomas
Viktorin, Adam
Senkerik, Roman
PROCEEDINGS OF THE 2023 GENETIC AND EVOLUTIONARY COMPUTATION CONFERENCE COMPANION, GECCO 2023 COMPANION, 2023, : 1812 - 1820
[32] Robust Prompt Optimization for Large Language Models Against Distribution Shifts
Li, Moxin
Wang, Wenjie
Feng, Fuli
Cao, Yixin
Zhang, Jizhi
Chua, Tat-Seng
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 1539 - 1554
[33] Improved Techniques for Optimization-Based Jailbreaking on Large Language Models
Jia, Xiaojun
Pang, Tianyu
Du, Chao
Huang, Yihao
Gu, Jindong
Liu, Yang
Cao, Xiaochun
Lin, Min
arXiv,
[34] Large Language Models
Vargas, Diego Collarana
Katsamanis, Nassos
ERCIM NEWS, 2024, (136): : 12 - 13
[35] Large Language Models
Cerf, Vinton G.
COMMUNICATIONS OF THE ACM, 2023, 66 (08) : 7 - 7
[36] P-TA: Using Proximal Policy Optimization to Enhance Tabular Data Augmentation via Large Language Models
Yang, Shuo
Yuan, Chenchen
Rong, Yao
Steinbauer, Felix
Kasneci, Gjergji
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 248 - 264
[37] VISA: Reasoning Video Object Segmentation via Large Language Models
Yan, Cilin
Wang, Haochen
Yan, Shilin
Jiang, Xiaolong
Hu, Yao
Kang, Guoliang
Xie, Weidi
Gavves, Efstratios
COMPUTER VISION - ECCV 2024, PT XV, 2025, 15073 : 98 - 115
[38] Data Stealing Attacks against Large Language Models via Backdooring
He, Jiaming
Hou, Guanyu
Jia, Xinyue
Chen, Yangyang
Liao, Wenqi
Zhou, Yinhang
Zhou, Rang
ELECTRONICS, 2024, 13 (14)
[39] Time Series Classification With Large Language Models via Linguistic Scaffolding
Jang, Hyeongwon
Yong Yang, June
Hwang, Jaeryong
Yang, Eunho
IEEE ACCESS, 2024, 12 : 170387 - 170398
[40] Capturing Failures of Large Language Models via Human Cognitive Biases
Jones, Erik
Steinhardt, Jacob
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,

← 1 2 3 4 5 →