Reward modeling for mitigating toxicity in transformer-based language models

被引:0
|
作者
Farshid Faal
Ketra Schmitt
Jia Yuan Yu
机构
[1] Concordia University,Concordia Institute for Information System Engineering
[2] Concordia University,Centre for Engineering in Society
来源
Applied Intelligence | 2023年 / 53卷
关键词
Language models; Transformers; Reinforcement learning; Toxic language mitigation; Natural language generation;
D O I
暂无
中图分类号
学科分类号
摘要
Transformer-based language models can generate fluent text and be efficiently adapted across various natural language generation tasks. However, language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors, consequently hindering their safe deployment. Various detoxification methods have been proposed to mitigate language model toxicity; however, these methods struggle to detoxify language models when conditioned on prompts that contain specific social identities related to gender, race, or religion. In this study, we propose Reinforce-Detoxify, a reinforcement learning-based method for mitigating toxicity in language models. We address the challenge of safety in language models and propose a new reward model that can detect toxic content and mitigate unintended bias towards social identities in toxicity prediction. The experiments demonstrate that the Reinforce-Detoxify method for language model detoxification outperforms existing detoxification approaches in automatic evaluation metrics, indicating that our approach in language model detoxification is less prone to unintended bias toward social identities in generated content.
引用
收藏
页码:8421 / 8435
页数:14
相关论文
共 50 条
  • [41] Classifying Drug Ratings Using User Reviews with Transformer-Based Language Models
    Shiju, Akhil
    He, Zhe
    2022 IEEE 10TH INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS (ICHI 2022), 2022, : 163 - 169
  • [42] Transformers-sklearn: a toolkit for medical language understanding with transformer-based models
    Feihong Yang
    Xuwen Wang
    Hetong Ma
    Jiao Li
    BMC Medical Informatics and Decision Making, 21
  • [43] Catching but a glimpse?-Navigating crowdsourced solution spaces with transformer-based language models
    Just, Julian
    Hutter, Katja
    Fueller, Johann
    CREATIVITY AND INNOVATION MANAGEMENT, 2024, 33 (04) : 718 - 741
  • [44] No Train No Gain: Revisiting Efficient Training Algorithms For Transformer-based Language Models
    Kaddour, Jean
    Key, Oscar
    Nawrot, Piotr
    Minervini, Pasquale
    Kusner, Matt J.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [45] Transformers-sklearn: a toolkit for medical language understanding with transformer-based models
    Yang, Feihong
    Wang, Xuwen
    Ma, Hetong
    Li, Jiao
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2021, 21 (SUPPL 2)
  • [46] Transformer-Based Music Language Modelling and Transcription
    Zonios, Christos
    Pavlopoulos, John
    Likas, Aristidis
    PROCEEDINGS OF THE 12TH HELLENIC CONFERENCE ON ARTIFICIAL INTELLIGENCE, SETN 2022, 2022,
  • [47] Transformer-based Natural Language Understanding and Generation
    Zhang, Feng
    An, Gaoyun
    Ruan, Qiuqi
    2022 16TH IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING (ICSP2022), VOL 1, 2022, : 281 - 284
  • [48] Not all quantifiers are equal: Probing transformer-based language models' understanding of generalised quantifiers
    Madusanka, Tharindu
    Zahid, Iqra
    Li, Hao
    Pratt-Hartmann, Ian
    Batista-Navarro, Riza
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 8680 - 8692
  • [49] The Case for Translation-Invariant Self-Attention in Transformer-Based Language Models
    Wennberg, Ulme
    Henter, Gustav Eje
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 130 - 140
  • [50] TransDTI: Transformer-Based Language Models for Estimating DTIs and Building a Drug Recommendation Workflow
    Kalakoti, Yogesh
    Yadav, Shashank
    Sundar, Durai
    ACS OMEGA, 2022, 7 (03): : 2706 - 2717