Detoxifying Large Language Models via Knowledge Editing

被引:0
|
作者
Wang, Mengru [1 ]
Zhang, Ningyu [1 ,6 ]
Xu, Ziwen [1 ]
Xi, Zekun [1 ]
Deng, Shumin [3 ]
Yao, Yunzhi [1 ]
Zhang, Qishen [2 ]
Yang, Linyi [4 ]
Wang, Jindong [5 ]
Chen, Huajun [1 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Ant Grp, Hangzhou, Peoples R China
[3] Natl Univ Singapore, NUS NCS Joint Lab, Singapore, Singapore
[4] Westlake Univ, Hangzhou, Peoples R China
[5] Microsoft Res Asia, Beijing, Peoples R China
[6] Southeast Univ, Key Lab New Generat Artificial Intelligence Techn, Minist Educ, Nanjing, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper investigates using knowledge editing techniques to detoxify Large Language Models (LLMs). We construct a benchmark, SafeEdit, which covers nine unsafe categories with various powerful attack prompts and equips comprehensive metrics for systematic evaluation. We conduct experiments with several knowledge editing approaches, indicating that knowledge editing has the potential to detoxify LLMs with a limited impact on general performance efficiently. Then, we propose a simple yet effective baseline, dubbed Detoxifying with Intraoperative Neural Monitoring (DINM), to diminish the toxicity of LLMs within a few tuning steps via only one instance. We further provide an in-depth analysis of the internal mechanism for various detoxifying approaches, demonstrating that previous methods like SFT and DPO may merely suppress the activations of toxic parameters, while DINM mitigates the toxicity of the toxic parameters to a certain extent, making permanent adjustments. We hope that these insights could shed light on future work of developing detoxifying approaches and the underlying knowledge mechanisms of LLMs1.
引用
收藏
页码:3093 / 3118
页数:26
相关论文
共 50 条
  • [21] Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
    Zhao, Wei
    Li, Zhe
    Li, Yige
    Zhang, Ye
    Sun, Jun
    arXiv,
  • [22] Knowledge Graph-Enhanced Large Language Models via Path Selection
    Liu, Haochen
    Wang, Song
    Zhu, Yaochen
    Dong, Yushun
    Li, Jundong
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 6311 - 6321
  • [23] Large Language Models as a Service: optimisation strategies via Knowledge Space reduction
    Panagoulias, Dimitrios P.
    Virvou, Maria
    Tsihrintzis, George A.
    2024 IEEE INTERNATIONAL CONFERENCE ON OMNI-LAYER INTELLIGENT SYSTEMS, COINS 2024, 2024, : 67 - 70
  • [24] History Matters: Temporal Knowledge Editing in Large Language Model
    Yin, Xunjian
    Jiang, Jin
    Yang, Liming
    Wan, Xiaojun
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 17, 2024, : 19413 - 19421
  • [25] Mitigating Privacy Seesaw in Large Language Models: Augmented Privacy Neuron Editing via Activation Patching
    Wu, Xinwei
    Dong, Weilong
    Xu, Shaoyang
    Xiong, Deyi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 5319 - 5332
  • [26] Quantifying Domain Knowledge in Large Language Models
    Sayenju, Sudhashree
    Aygun, Ramazan
    Franks, Bill
    Johnston, Sereres
    Lee, George
    Choi, Hansook
    Modgil, Girish
    2023 IEEE CONFERENCE ON ARTIFICIAL INTELLIGENCE, CAI, 2023, : 193 - 194
  • [27] Knowledge management in organization and the large language models
    Zelenkov, Yu. A.
    ROSSIISKII ZHURNAL MENEDZHMENTA-RUSSIAN MANAGEMENT JOURNAL, 2024, 22 (03): : 573 - 601
  • [28] Large language models encode clinical knowledge
    Singhal, Karan
    Azizi, Shekoofeh
    Tu, Tao
    Mahdavi, S. Sara
    Wei, Jason
    Chung, Hyung Won
    Scales, Nathan
    Tanwani, Ajay
    Cole-Lewis, Heather
    Pfohl, Stephen
    Payne, Perry
    Seneviratne, Martin
    Gamble, Paul
    Kelly, Chris
    Babiker, Abubakr
    Schaerli, Nathanael
    Chowdhery, Aakanksha
    Mansfield, Philip
    Demner-Fushman, Dina
    Arcas, Blaise Aguera y
    Webster, Dale
    Corrado, Greg S.
    Matias, Yossi
    Chou, Katherine
    Gottweis, Juraj
    Tomasev, Nenad
    Liu, Yun
    Rajkomar, Alvin
    Barral, Joelle
    Semturs, Christopher
    Karthikesalingam, Alan
    Natarajan, Vivek
    NATURE, 2023, 620 (7972) : 172 - +
  • [29] Debiasing Large Language Models with Structured Knowledge
    Ma, Congda
    Zhao, Tianyu
    Okumura, Manabu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 10274 - 10287
  • [30] Large language models encode clinical knowledge
    Karan Singhal
    Shekoofeh Azizi
    Tao Tu
    S. Sara Mahdavi
    Jason Wei
    Hyung Won Chung
    Nathan Scales
    Ajay Tanwani
    Heather Cole-Lewis
    Stephen Pfohl
    Perry Payne
    Martin Seneviratne
    Paul Gamble
    Chris Kelly
    Abubakr Babiker
    Nathanael Schärli
    Aakanksha Chowdhery
    Philip Mansfield
    Dina Demner-Fushman
    Blaise Agüera y Arcas
    Dale Webster
    Greg S. Corrado
    Yossi Matias
    Katherine Chou
    Juraj Gottweis
    Nenad Tomasev
    Yun Liu
    Alvin Rajkomar
    Joelle Barral
    Christopher Semturs
    Alan Karthikesalingam
    Vivek Natarajan
    Nature, 2023, 620 : 172 - 180