Reward modeling for mitigating toxicity in transformer-based language models

被引：0

作者：

Farshid Faal

Ketra Schmitt

Jia Yuan Yu

机构：

[1] Concordia University,Concordia Institute for Information System Engineering

[2] Concordia University,Centre for Engineering in Society

来源：

Applied Intelligence | 2023年 / 53卷

关键词：

Language models; Transformers; Reinforcement learning; Toxic language mitigation; Natural language generation;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Transformer-based language models can generate fluent text and be efficiently adapted across various natural language generation tasks. However, language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors, consequently hindering their safe deployment. Various detoxification methods have been proposed to mitigate language model toxicity; however, these methods struggle to detoxify language models when conditioned on prompts that contain specific social identities related to gender, race, or religion. In this study, we propose Reinforce-Detoxify, a reinforcement learning-based method for mitigating toxicity in language models. We address the challenge of safety in language models and propose a new reward model that can detect toxic content and mitigate unintended bias towards social identities in toxicity prediction. The experiments demonstrate that the Reinforce-Detoxify method for language model detoxification outperforms existing detoxification approaches in automatic evaluation metrics, indicating that our approach in language model detoxification is less prone to unintended bias toward social identities in generated content.

引用

页码：8421 / 8435

页数：14

共 50 条

[31] Task-Specific Transformer-Based Language Models in HealthCare:Scoping Review
Cho, Ha Na
Jun, Tae Joon
Kim, Young-Hak
Kang, Heejun
Ahn, Imjin
Gwon, Hansle
Kim, Yunha
Seo, Jiahn
Choi, Heejung
Kim, Minkyoung
Han, Jiye
Kee, Gaeun
Park, Seohyun
Ko, Soyoung
JMIR MEDICAL INFORMATICS, 2024, 12
[32] A Comparative Analysis of Transformer-based Protein Language Models for Remote Homology Prediction
Kabir, Anowarul
Moldwin, Asher
Shehu, Amarda
14TH ACM CONFERENCE ON BIOINFORMATICS, COMPUTATIONAL BIOLOGY, AND HEALTH INFORMATICS, BCB 2023, 2023,
[33] Transformer-based Language Models and Homomorphic Encryption: An Intersection with BERT-tiny
Rovida, Lorenzo
Leporati, Alberto
PROCEEDINGS OF THE 10TH ACM INTERNATIONAL WORKSHOP ON SECURITY AND PRIVACY ANALYTICS, IWSPA 2024, 2024, : 3 - 13
[34] Boost Transformer-based Language Models with GPU-Friendly Sparsity and Quantization
Yu, Chong
Chen, Tao
Gan, Zhongxue
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 218 - 235
[35] Empirical Study of Tweets Topic Classification Using Transformer-Based Language Models
Mandal, Ranju
Chen, Jinyan
Becken, Susanne
Stantic, Bela
INTELLIGENT INFORMATION AND DATABASE SYSTEMS, ACIIDS 2021, 2021, 12672 : 340 - 350
[36] An Architecture for Accelerated Large-Scale Inference of Transformer-Based Language Models
Ganiev, Amir
Chapin, Colt
de Andrade, Anderson
Liu, Chen
2021 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, NAACL-HLT 2021, 2021, : 163 - 169
[37] Influence of Language Proficiency on the Readability of Review Text and Transformer-based Models for Determining Language Proficiency
Sazzed, Salim
COMPANION PROCEEDINGS OF THE WEB CONFERENCE 2022, WWW 2022 COMPANION, 2022, : 881 - 886
[38] Cyberbullying Text Identification: A Deep Learning and Transformer-based Language Modeling Approach
Saifullah K.
Khan M.I.
Jamal S.
Sarker I.H.
EAI Endorsed Transactions on Industrial Networks and Intelligent Systems, 2024, 11 (01) : 1 - 12
[39] Bringing order into the realm of Transformer-based language models for artificial intelligence and law
Greco, Candida M.
Tagarelli, Andrea
ARTIFICIAL INTELLIGENCE AND LAW, 2024, 32 (04) : 863 - 1010
[40] Stress Test Evaluation of Transformer-based Models in Natural Language Understanding Tasks
Aspillaga, Carlos
Carvallo, Andres
Araujo, Vladimir
PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 1882 - 1894

← 1 2 3 4 5 →