Interpreting Pretrained Language Models via Concept Bottlenecks

被引:0
|
作者
Tan, Zhen [1 ]
Cheng, Lu [2 ]
Wang, Song [3 ]
Yuan, Bo [4 ]
Li, Jundong [3 ]
Liu, Huan [1 ]
机构
[1] Arizona State Univ, Tempe, AZ 85281 USA
[2] Univ Illinois, Chicago, IL USA
[3] Univ Virginia, Charlottesville, VA USA
[4] Zhejiang Univ, Hangzhou, Zhejiang, Peoples R China
关键词
Language Models; Interpretability; Conceptual Learning;
D O I
10.1007/978-981-97-2259-4_5
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Pretrained language models (PLMs) have made significant strides in various natural language processing tasks. However, the lack of interpretability due to their "black-box" nature poses challenges for responsible implementation. Although previous studies have attempted to improve interpretability by using, e.g., attention weights in self-attention layers, these weights often lack clarity, readability, and intuitiveness. In this research, we propose a novel approach to interpreting PLMs by employing high-level, meaningful concepts that are easily understandable for humans. For example, we learn the concept of "Food" and investigate how it influences the prediction of a model's sentiment towards a restaurant review. We introduce C3 M, which combines human-annotated and machine-generated concepts to extract hidden neurons designed to encapsulate semantically meaningful and task-specific concepts. Through empirical evaluations on real-world datasets, we show that our approach offers valuable insights to interpret PLM behavior, helps diagnose model failures, and enhances model robustness amidst noisy concept labels.
引用
收藏
页码:56 / 74
页数:19
相关论文
共 50 条
  • [1] Data Augmentation for Spoken Language Understanding via Pretrained Language Models
    Peng, Baolin
    Zhu, Chenguang
    Zeng, Michael
    Gao, Jianfeng
    [J]. INTERSPEECH 2021, 2021, : 1219 - 1223
  • [2] A Survey of Pretrained Language Models
    Sun, Kaili
    Luo, Xudong
    Luo, Michael Y.
    [J]. KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, 2022, 13369 : 442 - 456
  • [3] Generating Datasets with Pretrained Language Models
    Schick, Timo
    Schuetze, Hinrich
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 6943 - 6951
  • [4] Geographic Adaptation of Pretrained Language Models
    Hofmann, Valentin
    Glavas, Goran
    Ljubesic, Nikola
    Pierrehumbert, Janet B.
    Schuetze, Hinrich
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2024, 12 : 411 - 431
  • [5] Discourse Probing of Pretrained Language Models
    Koto, Fajri
    Lau, Jey Han
    Baldwin, Timothy
    [J]. 2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL-HLT 2021), 2021, : 3849 - 3864
  • [6] Investigating Transferability in Pretrained Language Models
    Tamkin, Alex
    Singh, Trisha
    Giovanardi, Davide
    Goodman, Noah
    [J]. FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1393 - 1401
  • [7] Textually Pretrained Speech Language Models
    Hassid, Michael
    Remez, Tal
    Nguyen, Tu Anh
    Gat, Itai
    Conneau, Alexis
    Kreuk, Felix
    Copet, Jade
    Defossez, Alexandre
    Synnaeve, Gabriel
    Dupoux, Emmanuel
    Schwartz, Roy
    Adi, Yossi
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [8] Unsupervised Paraphrasing with Pretrained Language Models
    Niu, Tong
    Yavuz, Semih
    Zhou, Yingbo
    Keskar, Nitish Shirish
    Wang, Huan
    Xiong, Caiming
    [J]. 2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 5136 - 5150
  • [9] Measuring and Improving Consistency in Pretrained Language Models
    Elazar, Yanai
    Kassner, Nora
    Ravfogel, Shauli
    Ravichander, Abhilasha
    Hovy, Eduard
    Schutze, Hinrich
    Goldberg, Yoav
    [J]. TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2021, 9 : 1012 - 1031
  • [10] Language Recognition Based on Unsupervised Pretrained Models
    Yu, Haibin
    Zhao, Jing
    Yang, Song
    Wu, Zhongqin
    Nie, Yuting
    Zhang, Wei-Qiang
    [J]. INTERSPEECH 2021, 2021, : 3271 - 3275