Mitigating Biases for Instruction-following Language Models via Bias Neurons Elimination

被引:0
|
作者
Yang, Nakyeong [1 ,2 ]
Kang, Taegwan [2 ]
Choi, Jungkyu [2 ]
Lee, Honglak [2 ,3 ]
Jung, Kyomin [1 ]
机构
[1] Seoul Natl Univ, Seoul, South Korea
[2] LG AI Res, Seoul, South Korea
[3] Univ Michigan, Ann Arbor, MI 48109 USA
基金
新加坡国家研究基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Instruction-following language models often show undesirable biases. These undesirable biases may be accelerated in the real-world usage of language models, where a wide range of instructions is used through zero-shot example prompting. To solve this problem, we first define the bias neuron, which significantly affects biased outputs, and prove its existence empirically. Furthermore, we propose a novel and practical bias mitigation method, CRISPR, to eliminate bias neurons of language models in instruction-following settings. CRISPR automatically determines biased outputs and categorizes neurons that affect the biased outputs as bias neurons using an explainability method. Experimental results demonstrate the effectiveness of our method in mitigating biases under zero-shot instruction-following settings without losing the model's task performance and existing knowledge. The experimental results reveal the generalizability of our method as it shows robustness under various instructions and datasets. Surprisingly, our method can mitigate the bias in language models by eliminating only a few neurons (at least three).
引用
收藏
页码:9061 / 9073
页数:13
相关论文
共 40 条
  • [21] Towards Mitigating Hallucination in Large Language Models via Self-Reflection
    Ji, Ziwei
    Yu, Tiezheng
    Xu, Yan
    Lee, Nayeon
    Ishii, Etsuko
    Fung, Pascale
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 1827 - 1843
  • [22] Bias Out-of-the-Box: An Empirical Analysis of Intersectional Occupational Biases in Popular Generative Language Models
    Kirk, Hannah Rose
    Jun, Yennie
    Iqbal, Haider
    Benussi, Elias
    Volpin, Filippo
    Dreyer, Frederic A.
    Shtedritski, Aleksandar
    Asano, Yuki M.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [23] Advancing entity recognition in biomedicine via instruction tuning of large language models
    Keloth, Vipina K.
    Hu, Yan
    Xie, Qianqian
    Peng, Xueqing
    Wang, Yan
    Zheng, Andrew
    Selek, Melih
    Raja, Kalpana
    Wei, Chih Hsuan
    Jin, Qiao
    Lu, Zhiyong
    Chen, Qingyu
    Xu, Hua
    BIOINFORMATICS, 2024, 40 (04)
  • [24] Language Models Get a Gender Makeover: Mitigating Gender Bias with Few-Shot Data Interventions
    Thakur, Himanshu
    Jain, Atishay
    Vaddamanu, Praneetha
    Liang, Paul Pu
    Morency, Louis-Philippe
    61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 340 - 351
  • [25] Mitigating spatial hallucination in large language models for path planning via prompt engineering
    Zhang, Hongjie
    Deng, Hourui
    Ou, Jie
    Feng, Chaosheng
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [26] Assessing Inherent Biases Following Prompt Compression of Large Language Models for Game Story Generation
    Taveekitworachai, Pittawat
    Plupattanakit, Kantinan
    Thawonmas, Ruck
    2024 IEEE CONFERENCE ON GAMES, COG 2024, 2024,
  • [27] Subtle Biases Need Subtler Measures: Dual Metrics for Evaluating Representative and Affinity Bias in Large Language Models
    Kumar, Abhishek
    Yunusov, Sarfaroz
    Emami, Ali
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1: LONG PAPERS, 2024, : 375 - 392
  • [28] Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training
    Guo, Qingyan
    Wang, Rui
    Guo, Junliang
    Tan, Xu
    Bian, Jiang
    Yang, Yujiu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: ACL 2024, 2024, : 11453 - 11464
  • [29] Mitigating Hallucination in Visual-Language Models via Re-balancing Contrastive Decoding
    Liang, Xiaoyu
    Yu, Jiayuan
    Mu, Lianrui
    Zhuang, Jiedong
    Hu, Jiaqi
    Yang, Yuchen
    Ye, Jiangnan
    Lu, Lu
    Chen, Jian
    Hu, Haoji
    PATTERN RECOGNITION AND COMPUTER VISION, PT V, PRCV 2024, 2025, 15035 : 482 - 496
  • [30] CommonIT: Commonality-Aware Instruction Tuning for Large Language Models via Data Partitions
    Rao, Jun
    Liu, Xuebo
    Lian, Lian
    Cheng, Shengjun
    Liao, Yunjie
    Zhang, Min
    EMNLP 2024 - 2024 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, 2024, : 10064 - 10083