Reward modeling for mitigating toxicity in transformer-based language models

被引:0
|
作者
Farshid Faal
Ketra Schmitt
Jia Yuan Yu
机构
[1] Concordia University,Concordia Institute for Information System Engineering
[2] Concordia University,Centre for Engineering in Society
来源
Applied Intelligence | 2023年 / 53卷
关键词
Language models; Transformers; Reinforcement learning; Toxic language mitigation; Natural language generation;
D O I
暂无
中图分类号
学科分类号
摘要
Transformer-based language models can generate fluent text and be efficiently adapted across various natural language generation tasks. However, language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors, consequently hindering their safe deployment. Various detoxification methods have been proposed to mitigate language model toxicity; however, these methods struggle to detoxify language models when conditioned on prompts that contain specific social identities related to gender, race, or religion. In this study, we propose Reinforce-Detoxify, a reinforcement learning-based method for mitigating toxicity in language models. We address the challenge of safety in language models and propose a new reward model that can detect toxic content and mitigate unintended bias towards social identities in toxicity prediction. The experiments demonstrate that the Reinforce-Detoxify method for language model detoxification outperforms existing detoxification approaches in automatic evaluation metrics, indicating that our approach in language model detoxification is less prone to unintended bias toward social identities in generated content.
引用
收藏
页码:8421 / 8435
页数:14
相关论文
共 50 条
  • [31] Task-Specific Transformer-Based Language Models in HealthCare:Scoping Review
    Cho, Ha Na
    Jun, Tae Joon
    Kim, Young-Hak
    Kang, Heejun
    Ahn, Imjin
    Gwon, Hansle
    Kim, Yunha
    Seo, Jiahn
    Choi, Heejung
    Kim, Minkyoung
    Han, Jiye
    Kee, Gaeun
    Park, Seohyun
    Ko, Soyoung
    JMIR MEDICAL INFORMATICS, 2024, 12
  • [32] A Comparative Analysis of Transformer-based Protein Language Models for Remote Homology Prediction
    Kabir, Anowarul
    Moldwin, Asher
    Shehu, Amarda
    14TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, BCB 2023, 2023,
  • [33] Transformer-based Language Models and Homomorphic Encryption: An Intersection with BERT-tiny
    Rovida, Lorenzo
    Leporati, Alberto
    PROCEEDINGS OF THE 10TH ACM INTERNATIONAL WORKSHOP ON SECURITY AND PRIVACY ANALYTICS, IWSPA 2024, 2024, : 3 - 13
  • [34] Boost Transformer-based Language Models with GPU-Friendly Sparsity and Quantization
    Yu, Chong
    Chen, Tao
    Gan, Zhongxue
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 218 - 235
  • [35] Empirical Study of Tweets Topic Classification Using Transformer-Based Language Models
    Mandal, Ranju
    Chen, Jinyan
    Becken, Susanne
    Stantic, Bela
    INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2021, 2021, 12672 : 340 - 350
  • [36] An Architecture for Accelerated Large-Scale Inference of Transformer-Based Language Models
    Ganiev, Amir
    Chapin, Colt
    de Andrade, Anderson
    Liu, Chen
    2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, NAACL-HLT 2021, 2021, : 163 - 169
  • [37] Influence of Language Proficiency on the Readability of Review Text and Transformer-based Models for Determining Language Proficiency
    Sazzed, Salim
    COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2022, WWW 2022 COMPANION, 2022, : 881 - 886
  • [38] Cyberbullying Text Identification: A Deep Learning and Transformer-based Language Modeling Approach
    Saifullah K.
    Khan M.I.
    Jamal S.
    Sarker I.H.
    EAI Endorsed Transactions on Industrial Networks and Intelligent Systems, 2024, 11 (01) : 1 - 12
  • [39] Bringing order into the realm of Transformer-based language models for artificial intelligence and law
    Greco, Candida M.
    Tagarelli, Andrea
    ARTIFICIAL INTELLIGENCE AND LAW, 2024, 32 (04) : 863 - 1010
  • [40] Stress Test Evaluation of Transformer-based Models in Natural Language Understanding Tasks
    Aspillaga, Carlos
    Carvallo, Andres
    Araujo, Vladimir
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 1882 - 1894