Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination

被引:0
|
作者
Yang, Nakyeong [1 ,2 ]
Kang, Taegwan [2 ]
Choi, Jungkyu [2 ]
Lee, Honglak [2 ,3 ]
Jung, Kyomin [1 ]
机构
[1] Seoul Natl Univ, Seoul, South Korea
[2] LG AI Res, Seoul, South Korea
[3] Univ Michigan, Ann Arbor, MI 48109 USA
基金
新加坡国家研究基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Instruction-following language models often show undesirable biases. These undesirable biases may be accelerated in the real-world usage of language models, where a wide range of instructions is used through zero-shot example prompting. To solve this problem, we first define the bias neuron, which significantly affects biased outputs, and prove its existence empirically. Furthermore, we propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings. CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an explainability method. Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instruction-following settings without losing the model's task performance and existing knowledge. The experimental results reveal the generalizability of our method as it shows robustness under various instructions and datasets. Surprisingly, our method can mitigate the bias in language models by eliminating only a few neurons (at least three).
引用
收藏
页码:9061 / 9073
页数:13
相关论文
共 40 条
  • [1] Measuring bias in Instruction-Following models with P-AT
    Onorati, Dario
    Ruzzetti, Elena Sofia
    Venditti, Davide
    Ranaldi, Leonardo
    Zanzotto, Fabio Massimo
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 8006 - 8034
  • [2] FactLLaMA: Optimizing Instruction-Following Language Models with External Knowledge for Automated Fact-Checking
    Cheung, Tsun-Hin
    Lam, Kin-Man
    2023 ASIA PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE, APSIPA ASC, 2023, : 846 - 853
  • [3] CIF-Bench: A Chinese Instruction-Following Benchmark for Evaluating the Generalizability of Large Language Models
    Li, Yizhi
    Zhang, Ge
    Qu, Xingwei
    Li, Jiali
    Li, Zhaoqun
    Wang, Zekun
    Li, Hao
    Yuan, Ruibin
    Ma, Yinghao
    Zhang, Kai
    Zhou, Wangchunshu
    Liang, Yiming
    Zhang, Lei
    Man, Lei
    Zhang, Jiajun
    Li, Zuowen
    Huang, Stephen W.
    Lin, Chenghua
    Fu, Jie
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 12431 - 12446
  • [4] Natural Language Instruction-following with Task-related Language Development and Translation
    Pang, Jing-Cheng
    Yang, Xinyu
    Yang, Si-Hang
    Chen, Xiong-Hui
    Yu, Yang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [5] Evaluating Correctness and Faithfulness of Instruction-Following Models for Question Answering
    Adlakha, Vaibhav
    Ghader, Parishad Behnam
    Lu, Xing Han
    Meade, Nicholas
    Reddy, Siva
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 681 - 699
  • [6] Efficient Inference of Vision Instruction-Following Models with Elastic Cache
    Liu, Zuyan
    Liu, Benlin
    Wang, Jiahui
    Dong, Yuhao
    Chen, Guangyi
    Rao, Yongming
    Krishna, Ranjay
    Lu, Jiwen
    COMPUTER VISION - ECCV 2024, PT XVII, 2025, 15075 : 54 - 69
  • [7] Towards Understanding and Mitigating Social Biases in Language Models
    Liang, Paul Pu
    Wu, Chiyu
    Morency, Louis-Philippe
    Salakhutdinov, Ruslan
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [8] Mitigating Gender Bias in Distilled Language Models via Counterfactual Role Reversal
    Gupta, Umang
    Dhamala, Jwala
    Kumar, Varun
    Verma, Apury
    Pruksachatkun, Yada
    Krishna, Satyapriya
    Gupta, Rahul
    Chang, Kai-Wei
    Ver Steeg, Greg
    Galstyan, Aram
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 658 - 678
  • [9] Locating and Mitigating Gender Bias in Large Language Models
    Cai, Yuchen
    Cao, Ding
    Guo, Rongxi
    Wen, Yaqin
    Liu, Guiquan
    Chen, Enhong
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT IV, ICIC 2024, 2024, 14878 : 471 - 482
  • [10] Answer is All You Need: Instruction-following Text Embedding via Answering the Question
    Peng, Letian
    Zhang, Yuwei
    Wang, Zilong
    Srinivasa, Jayanth
    Liu, Gaowen
    Wang, Zihan
    Shang, Jingbo
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 459 - 477