Detoxifying Large Language Models via Knowledge Editing

被引:0
|
作者
Wang, Mengru [1 ]
Zhang, Ningyu [1 ,6 ]
Xu, Ziwen [1 ]
Xi, Zekun [1 ]
Deng, Shumin [3 ]
Yao, Yunzhi [1 ]
Zhang, Qishen [2 ]
Yang, Linyi [4 ]
Wang, Jindong [5 ]
Chen, Huajun [1 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Ant Grp, Hangzhou, Peoples R China
[3] Natl Univ Singapore, NUS NCS Joint Lab, Singapore, Singapore
[4] Westlake Univ, Hangzhou, Peoples R China
[5] Microsoft Res Asia, Beijing, Peoples R China
[6] Southeast Univ, Key Lab New Generat Artificial Intelligence Techn, Minist Educ, Nanjing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts and equips comprehensive metrics for systematic evaluation. We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently. Then, we propose a simple yet effective baseline, dubbed Detoxifying with Intraoperative Neural Monitoring (DINM), to diminish the toxicity of LLMs within a few tuning steps via only one instance. We further provide an in-depth analysis of the internal mechanism for various detoxifying approaches, demonstrating that previous methods like SFT and DPO may merely suppress the activations of toxic parameters, while DINM mitigates the toxicity of the toxic parameters to a certain extent, making permanent adjustments. We hope that these insights could shed light on future work of developing detoxifying approaches and the underlying knowledge mechanisms of LLMs1.
引用
收藏
页码:3093 / 3118
页数:26
相关论文
共 50 条
  • [1] Knowledge Editing for Large Language Models: A Survey
    Wang, Song
    Zhu, Yaochen
    Liu, Haochen
    Zheng, Zaiyi
    Chen, Chen
    Li, Jundong
    ACM COMPUTING SURVEYS, 2025, 57 (03)
  • [2] Detoxifying Large Language Models via Kahneman-Tversky Optimization
    Li, Qingquan
    Du, Wenlong
    Liu, Jin
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT V, NLPCC 2024, 2025, 15363 : 409 - 417
  • [3] Knowledge Editing of Large Language Models Unconstrained by Word Order
    Ishigaki, Ryoma
    Suzuki, Jundai
    Shuzo, Masaki
    Maeda, Eisaku
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 177 - 187
  • [4] Cross-Lingual Knowledge Editing in Large Language Models
    Wang, Jiaan
    Liang, Yunlong
    Sun, Zengkui
    Cao, Yuxuan
    Xu, Jiarong
    Meng, Fandong
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 11676 - 11686
  • [5] InstructEdit: Instruction-Based Knowledge Editing for Large Language Models
    Zhang, Ningyu
    Tian, Bozhong
    Cheng, Siyuan
    Liang, Xiaozhuan
    Hu, Yi
    Xue, Kouying
    Gou, Yanjie
    Chen, Xi
    Chen, Huajun
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 6633 - 6641
  • [6] Editing Factual Knowledge in Language Models
    De Cao, Nicola
    Aziz, Wilker
    Titov, Ivan
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6491 - 6506
  • [7] Editing Personality For Large Language Models
    Mao, Shengyu
    Wang, Xiaohan
    Wang, Mengru
    Jiang, Yong
    Xie, Pengjun
    Huang, Fei
    Zhang, Ningyu
    NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, PT II, NLPCC 2024, 2025, 15360 : 241 - 254
  • [8] EasyEdit: An Easy-to-use Knowledge Editing Framework for Large Language Models
    Wang, Peng
    Zhang, Ningyu
    Tian, Bozhong
    Xi, Zekun
    Yao, Yunzhi
    Xu, Ziwen
    Wang, Mengru
    Mao, Shengyu
    Wang, Xiaohan
    Cheng, Siyuan
    Liu, Kangwei
    Ni, Yuansheng
    Zheng, Guozhou
    Chen, Huajun
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 3: SYSTEM DEMONSTRATIONS, 2024, : 82 - 93
  • [9] Self-Detoxifying Language Models via Toxification Reversal
    Leong, Chak Tou
    Cheng, Yi
    Wang, Jiashuo
    Wang, Jian
    Li, Wenjie
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 4433 - 4449
  • [10] Challenges in Detoxifying Language Models
    Welbl, Johannes
    Glaese, Amelia
    Uesato, Jonathan
    Dathathri, Sumanth
    Mellor, John
    Hendricks, Lisa Anne
    Anderson, Kirsty
    Kohli, Pushmeet
    Coppin, Ben
    Huang, Po-Sen
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 2447 - 2469