Detoxifying Large Language Models via Kahneman-Tversky Optimization

被引:0
|
作者
Li, Qingquan [1 ]
Du, Wenlong [1 ]
Liu, Jin [1 ]
机构
[1] Ant Grp, Hangzhou, Peoples R China
关键词
Large language models; Detoxification; Alignment;
D O I
10.1007/978-981-97-9443-0_36
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Currently, the application of Large Language Models (LLMs) faces significant security threats. Harmful questions and adversarial attack prompts can induce the LLMs to generate toxic responses. Therefore, detoxifying LLMs is a critical research topic to ensure their safe and widespread application. In this paper, we propose an alignment-based detoxification method for LLMs. We utilize Kahneman-Tversky Optimization (KTO) to align LLMs. During the construction of the training dataset, we take into account both the detoxification performance and the potential side effect on the LLMs. For detoxification, we make the LLM preferentially generate safe responses rather than toxic contents when asked with harmful questions and attack prompts. To mitigate the potential side effect on the conversational capabilities of LLMs, we incorporate normal questions into the training data, and ensure that the LLM generate normal answers, rather than safety refusals or unsafe responses. Experimental results show that our method showcase the best detoxification performance among all baseline methods while exerting little negative impact on the LLMs. Moreover, our method even enhance the LLMs' general abilities such as question answering and language understanding. Our proposed method achieve the first place in the NLPCC 2024 Share Task 10 Track 2 with an average score of 52.31.
引用
收藏
页码:409 / 417
页数:9
相关论文
共 50 条
  • [1] Kahneman, Tversky, and Kahneman-Tversky: three ways of thinking
    Johnson-Laird, P. N.
    THINKING & REASONING, 2024, 30 (04) : 531 - 547
  • [2] Type indeterminacy: A model of the KT(Kahneman-Tversky)-man
    Mogiliansky, Ariane Lambert
    Zamir, Shmuel
    Zwirn, Herve
    JOURNAL OF MATHEMATICAL PSYCHOLOGY, 2009, 53 (05) : 349 - 361
  • [3] Three explanations for the Kahneman-Tversky Programme of the 1970s
    Heukelom, Floris
    EUROPEAN JOURNAL OF THE HISTORY OF ECONOMIC THOUGHT, 2012, 19 (05): : 797 - 828
  • [4] Detoxifying Large Language Models via Knowledge Editing
    Wang, Mengru
    Zhang, Ningyu
    Xu, Ziwen
    Xi, Zekun
    Deng, Shumin
    Yao, Yunzhi
    Zhang, Qishen
    Yang, Linyi
    Wang, Jindong
    Chen, Huajun
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 3093 - 3118
  • [5] THEORETICAL COMPARISON OF THE DECISION THEORIES OF KEYNES,J.,M., KAHNEMAN-TVERSKY, AND EINHORN-HOGARTH
    BRADY, ME
    LEE, HB
    PSYCHOLOGICAL REPORTS, 1991, 69 (01) : 243 - 251
  • [6] Averting risk in the face of large losses: Bernoulli vs. Tversky and Kahneman
    Bosch-Domenech, Antoni
    Silvestre, Joaquim
    ECONOMICS LETTERS, 2010, 107 (02) : 180 - 182
  • [7] Self-Detoxifying Language Models via Toxification Reversal
    Leong, Chak Tou
    Cheng, Yi
    Wang, Jiashuo
    Wang, Jian
    Li, Wenjie
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 4433 - 4449
  • [8] Challenges in Detoxifying Language Models
    Welbl, Johannes
    Glaese, Amelia
    Uesato, Jonathan
    Dathathri, Sumanth
    Mellor, John
    Hendricks, Lisa Anne
    Anderson, Kirsty
    Kohli, Pushmeet
    Coppin, Ben
    Huang, Po-Sen
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 2447 - 2469
  • [9] Detoxifying Language Models with a Toxic Corpus
    Park, Yoon A.
    Rudzicz, Frank
    PROCEEDINGS OF THE SECOND WORKSHOP ON LANGUAGE TECHNOLOGY FOR EQUALITY, DIVERSITY AND INCLUSION (LTEDI 2022), 2022, : 41 - 46
  • [10] Prompt Optimization in Large Language Models
    Sabbatella, Antonio
    Ponti, Andrea
    Giordani, Ilaria
    Candelieri, Antonio
    Archetti, Francesco
    MATHEMATICS, 2024, 12 (06)