Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination

被引:0
|
作者
Yang, Nakyeong [1 ,2 ]
Kang, Taegwan [2 ]
Choi, Jungkyu [2 ]
Lee, Honglak [2 ,3 ]
Jung, Kyomin [1 ]
机构
[1] Seoul Natl Univ, Seoul, South Korea
[2] LG AI Res, Seoul, South Korea
[3] Univ Michigan, Ann Arbor, MI 48109 USA
基金
新加坡国家研究基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Instruction-following language models often show undesirable biases. These undesirable biases may be accelerated in the real-world usage of language models, where a wide range of instructions is used through zero-shot example prompting. To solve this problem, we first define the bias neuron, which significantly affects biased outputs, and prove its existence empirically. Furthermore, we propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings. CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an explainability method. Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instruction-following settings without losing the model's task performance and existing knowledge. The experimental results reveal the generalizability of our method as it shows robustness under various instructions and datasets. Surprisingly, our method can mitigate the bias in language models by eliminating only a few neurons (at least three).
引用
收藏
页码:9061 / 9073
页数:13
相关论文
共 40 条
  • [31] Mitigating Privacy Seesaw in Large Language Models: Augmented Privacy Neuron Editing via Activation Patching
    Wu, Xinwei
    Dong, Weilong
    Xu, Shaoyang
    Xiong, Deyi
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 5319 - 5332
  • [32] CoGenesis: A Framework Collaborating Large and Small Language Models for Secure Context-Aware Instruction Following
    Zhang, Kaiyan
    Wang, Jianyu
    Hua, Ermo
    Qi, Biqing
    Ding, Ning
    Zhou, Bowen
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 4295 - 4312
  • [33] Mitigating Cognitive Biases in Clinical Decision-Making Through Multi-Agent Conversations Using Large Language Models: Simulation Study
    Ke, Yuhe
    Yang, Rui
    Lie, Sui An
    Lim, Taylor Xin Yi
    Ning, Yilin
    Li, Irene
    Abdullah, Hairil Rizal
    Ting, Daniel Shu Wei
    Liu, Nan
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [34] Co2PT: Mitigating Bias in Pre-trained Language Models through Counterfactual Contrastive Prompt Tuning
    Dong, Xiangjue
    Zhu, Ziwei
    Wang, Zhuoer
    Teleki, Maria
    Caverlee, James
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 5859 - 5871
  • [35] Mitigating bias in artificial intelligence: Fair data generation via causal models for transparent and explainable decision-making
    Gonzalez-Sendino, Ruben
    Serrano, Emilio
    Bajo, Javier
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2024, 155 : 384 - 401
  • [36] InstructGraph: Boosting Large Language Models via Graph-centric Instruction Tuning and Preference Alignment
    Wang, Jianing
    Wu, Junda
    Hon, Yupeng
    Liu, Yao
    Gao, Ming
    McAuley, Julian
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 13492 - 13510
  • [37] Mitigating Hallucinations in Large Language Models via Semantic Enrichment of Prompts: Insights from BioBERT and Ontological Integration
    Penkov, Stanislav
    PROCEEDINGS OF THE SIXTH INTERNATIONAL CONFERENCE COMPUTATIONAL LINGUISTICS IN BULGARIA, CLIB 2024, 2024, : 272 - 276
  • [38] Mitigating Demographic Bias of Federated Learning Models via Robust-Fair Domain Smoothing: A Domain-Shifting Approach
    Zeng, Huimin
    Yue, Zhenrui
    Jiang, Qian
    Zhang, Yang
    Shang, Lanyu
    Zong, Ruohan
    Wang, Dong
    2024 IEEE 44TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS, ICDCS 2024, 2024, : 785 - 796
  • [39] Empowering Legal Citation Recommendation via Efficient Instruction-Tuning of Pre-trained Language Models
    Wang, Jie
    Bansal, Kanha
    Arapakis, Ioannis
    Ge, Xuri
    Jose, Joemon M.
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 310 - 324
  • [40] Empowering Cross-lingual Abilities of Instruction-tuned Large Language Models by Translation-following Demonstrations
    Ranaldi, Leonardo
    Pucci, Giulia
    Freitas, Andre
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 7961 - 7973