Detoxifying Large Language Models via Kahneman-Tversky Optimization

被引：0

作者：

Li, Qingquan ^{[1
]}

Du, Wenlong ^{[1
]}

Liu, Jin ^{[1
]}

机构：

[1] Ant Grp, Hangzhou, Peoples R China

来源：

NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT V, NLPCC 2024 | 2025年 / 15363卷

关键词：

Large language models; Detoxification; Alignment;

D O I：

10.1007/978-981-97-9443-0_36

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Currently, the application of Large Language Models (LLMs) faces significant security threats. Harmful questions and adversarial attack prompts can induce the LLMs to generate toxic responses. Therefore, detoxifying LLMs is a critical research topic to ensure their safe and widespread application. In this paper, we propose an alignment-based detoxification method for LLMs. We utilize Kahneman-Tversky Optimization (KTO) to align LLMs. During the construction of the training dataset, we take into account both the detoxification performance and the potential side effect on the LLMs. For detoxification, we make the LLM preferentially generate safe responses rather than toxic contents when asked with harmful questions and attack prompts. To mitigate the potential side effect on the conversational capabilities of LLMs, we incorporate normal questions into the training data, and ensure that the LLM generate normal answers, rather than safety refusals or unsafe responses. Experimental results show that our method showcase the best detoxification performance among all baseline methods while exerting little negative impact on the LLMs. Moreover, our method even enhance the LLMs' general abilities such as question answering and language understanding. Our proposed method achieve the first place in the NLPCC 2024 Share Task 10 Track 2 with an average score of 52.31.

引用

页码：409 / 417

页数：9

共 50 条

[41] Incorporating Molecular Knowledge in Large Language Models via Multimodal Modeling
Yang, Zekun
Lv, Kun
Shu, Jian
Li, Zheng
Xiao, Ping
IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2025,
[42] LongVLM: Efficient Long Video Understanding via Large Language Models
Weng, Yuetian
Han, Mingfei
He, Haoyu
Chang, Xiaojun
Zhuang, Bohan
COMPUTER VISION - ECCV 2024, PT XXXIII, 2025, 15091 : 453 - 470
[43] Extending Context Window of Large Language Models via Semantic Compression
Fei, Weizhi
Niu, Xueyan
Zhou, Pingyi
Hou, Lu
Bai, Bo
Deng, Lei
Han, Wei
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 5169 - 5181
[44] Difficulty aware programming knowledge tracing via large language models
Yang, Lina
Sun, Xinjie
Li, Hui
Xu, Ran
Wei, Xuqin
SCIENTIFIC REPORTS, 2025, 15 (01):
[45] Aligning Large Language Models via Fine-grained Supervision
Liang, Dehong
Qiu, Liang
Kim, Minseok
Ladhak, Faisal
Do, Jaeyoung
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 673 - 680
[46] Computational Thematic Analysis of Poetry via Bimodal Large Language Models
Choi K.
Proceedings of the Association for Information Science and Technology, 2023, 60 (01) : 538 - 542
[47] Reducing hallucinations of large language models via hierarchical semantic piece
Liu, Yanyi
Yang, Qingwen
Tang, Jiawei
Guo, Tiezheng
Wang, Chen
Li, Pan
Xu, Sai
Gao, Xianlin
Li, Zhi
Liu, Jun
Wen, Yingyou
COMPLEX & INTELLIGENT SYSTEMS, 2025, 11 (05)
[48] Integrating chemistry knowledge in large language models via prompt engineering
Liu, Hongxuan
Yin, Haoyu
Luo, Zhiyao
Wang, Xiaonan
SYNTHETIC AND SYSTEMS BIOTECHNOLOGY, 2025, 10 (01) : 23 - 38
[49] MONITORASSISTANT: Simplifying Cloud Service Monitoring via Large Language Models
Yu, Zhaoyang
Ma, Minghua
Zhang, Chaoyun
Qin, Si
Kang, Yu
Bansal, Chetan
Rajmohan, Saravan
Dang, Yingnong
Pei, Changhua
Pei, Dan
Lin, Qingwei
Zhang, Dongmei
COMPANION PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, FSE COMPANION 2024, 2024, : 38 - 49
[50] Towards Autonomous Testing Agents via Conversational Large Language Models
Feldt, Robert
Kang, Sungmin
Yoon, Juyeon
Yoo, Shin
2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE, 2023, : 1688 - 1693

← 1 2 3 4 5 →