Detoxifying Large Language Models via Kahneman-Tversky Optimization

被引:0
|
作者
Li, Qingquan [1 ]
Du, Wenlong [1 ]
Liu, Jin [1 ]
机构
[1] Ant Grp, Hangzhou, Peoples R China
关键词
Large language models; Detoxification; Alignment;
D O I
10.1007/978-981-97-9443-0_36
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Currently, the application of Large Language Models (LLMs) faces significant security threats. Harmful questions and adversarial attack prompts can induce the LLMs to generate toxic responses. Therefore, detoxifying LLMs is a critical research topic to ensure their safe and widespread application. In this paper, we propose an alignment-based detoxification method for LLMs. We utilize Kahneman-Tversky Optimization (KTO) to align LLMs. During the construction of the training dataset, we take into account both the detoxification performance and the potential side effect on the LLMs. For detoxification, we make the LLM preferentially generate safe responses rather than toxic contents when asked with harmful questions and attack prompts. To mitigate the potential side effect on the conversational capabilities of LLMs, we incorporate normal questions into the training data, and ensure that the LLM generate normal answers, rather than safety refusals or unsafe responses. Experimental results show that our method showcase the best detoxification performance among all baseline methods while exerting little negative impact on the LLMs. Moreover, our method even enhance the LLMs' general abilities such as question answering and language understanding. Our proposed method achieve the first place in the NLPCC 2024 Share Task 10 Track 2 with an average score of 52.31.
引用
收藏
页码:409 / 417
页数:9
相关论文
共 50 条
  • [41] Incorporating Molecular Knowledge in Large Language Models via Multimodal Modeling
    Yang, Zekun
    Lv, Kun
    Shu, Jian
    Li, Zheng
    Xiao, Ping
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2025,
  • [42] LongVLM: Efficient Long Video Understanding via Large Language Models
    Weng, Yuetian
    Han, Mingfei
    He, Haoyu
    Chang, Xiaojun
    Zhuang, Bohan
    COMPUTER VISION - ECCV 2024, PT XXXIII, 2025, 15091 : 453 - 470
  • [43] Extending Context Window of Large Language Models via Semantic Compression
    Fei, Weizhi
    Niu, Xueyan
    Zhou, Pingyi
    Hou, Lu
    Bai, Bo
    Deng, Lei
    Han, Wei
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 5169 - 5181
  • [44] Difficulty aware programming knowledge tracing via large language models
    Yang, Lina
    Sun, Xinjie
    Li, Hui
    Xu, Ran
    Wei, Xuqin
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [45] Aligning Large Language Models via Fine-grained Supervision
    Liang, Dehong
    Qiu, Liang
    Kim, Minseok
    Ladhak, Faisal
    Do, Jaeyoung
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 673 - 680
  • [46] Computational Thematic Analysis of Poetry via Bimodal Large Language Models
    Choi K.
    Proceedings of the Association for Information Science and Technology, 2023, 60 (01) : 538 - 542
  • [47] Reducing hallucinations of large language models via hierarchical semantic piece
    Liu, Yanyi
    Yang, Qingwen
    Tang, Jiawei
    Guo, Tiezheng
    Wang, Chen
    Li, Pan
    Xu, Sai
    Gao, Xianlin
    Li, Zhi
    Liu, Jun
    Wen, Yingyou
    COMPLEX & INTELLIGENT SYSTEMS, 2025, 11 (05)
  • [48] Integrating chemistry knowledge in large language models via prompt engineering
    Liu, Hongxuan
    Yin, Haoyu
    Luo, Zhiyao
    Wang, Xiaonan
    SYNTHETIC AND SYSTEMS BIOTECHNOLOGY, 2025, 10 (01) : 23 - 38
  • [49] MONITORASSISTANT: Simplifying Cloud Service Monitoring via Large Language Models
    Yu, Zhaoyang
    Ma, Minghua
    Zhang, Chaoyun
    Qin, Si
    Kang, Yu
    Bansal, Chetan
    Rajmohan, Saravan
    Dang, Yingnong
    Pei, Changhua
    Pei, Dan
    Lin, Qingwei
    Zhang, Dongmei
    COMPANION PROCEEDINGS OF THE 32ND ACM INTERNATIONAL CONFERENCE ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, FSE COMPANION 2024, 2024, : 38 - 49
  • [50] Towards Autonomous Testing Agents via Conversational Large Language Models
    Feldt, Robert
    Kang, Sungmin
    Yoon, Juyeon
    Yoo, Shin
    2023 38TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING, ASE, 2023, : 1688 - 1693