Reward modeling for mitigating toxicity in transformer-based language models

被引：0

作者：

Farshid Faal

Ketra Schmitt

Jia Yuan Yu

机构：

[1] Concordia University,Concordia Institute for Information System Engineering

[2] Concordia University,Centre for Engineering in Society

来源：

Applied Intelligence | 2023年 / 53卷

关键词：

Language models; Transformers; Reinforcement learning; Toxic language mitigation; Natural language generation;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Transformer-based language models can generate fluent text and be efficiently adapted across various natural language generation tasks. However, language models that are pretrained on large unlabeled web text corpora have been shown to suffer from degenerating toxic content and social bias behaviors, consequently hindering their safe deployment. Various detoxification methods have been proposed to mitigate language model toxicity; however, these methods struggle to detoxify language models when conditioned on prompts that contain specific social identities related to gender, race, or religion. In this study, we propose Reinforce-Detoxify, a reinforcement learning-based method for mitigating toxicity in language models. We address the challenge of safety in language models and propose a new reward model that can detect toxic content and mitigate unintended bias towards social identities in toxicity prediction. The experiments demonstrate that the Reinforce-Detoxify method for language model detoxification outperforms existing detoxification approaches in automatic evaluation metrics, indicating that our approach in language model detoxification is less prone to unintended bias toward social identities in generated content.

引用

页码：8421 / 8435

页数：14

共 50 条

[1] Reward modeling for mitigating toxicity in transformer-based language models
Faal, Farshid
Schmitt, Ketra
Yu, Jia Yuan
APPLIED INTELLIGENCE, 2023, 53 (07) : 8421 - 8435
[2] Quantifying the Bias of Transformer-Based Language Models for African American English in Masked Language Modeling
Salutari, Flavia
Ramos, Jerome
Rahmani, Hossein A.
Linguaglossa, Leonardo
Lipani, Aldo
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2023, PT I, 2023, 13935 : 532 - 543
[3] Ouroboros: On Accelerating Training of Transformer-Based Language Models
Yang, Qian
Huo, Zhouyuan
Wang, Wenlin
Huang, Heng
Carin, Lawrence
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[4] Transformer-Based Language Models for Software Vulnerability Detection
Thapa, Chandra
Jang, Seung Ick
Ahmed, Muhammad Ejaz
Camtepe, Seyit
Pieprzyk, Josef
Nepal, Surya
PROCEEDINGS OF THE 38TH ANNUAL COMPUTER SECURITY APPLICATIONS CONFERENCE, ACSAC 2022, 2022, : 481 - 496
[5] A Comparison of Transformer-Based Language Models on NLP Benchmarks
Greco, Candida Maria
Tagarelli, Andrea
Zumpano, Ester
NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2022), 2022, 13286 : 490 - 501
[6] RadBERT: Adapting Transformer-based Language Models to Radiology
Yan, An
McAuley, Julian
Lu, Xing
Du, Jiang
Chang, Eric Y.
Gentili, Amilcare
Hsu, Chun-Nan
RADIOLOGY-ARTIFICIAL INTELLIGENCE, 2022, 4 (04)
[7] Applications of transformer-based language models in bioinformatics: a survey
Zhang, Shuang
Fan, Rui
Liu, Yuti
Chen, Shuang
Liu, Qiao
Zeng, Wanwen
NEURO-ONCOLOGY ADVANCES, 2023, 5 (01)
[8] TAG: Gradient Attack on Transformer-based Language Models
Deng, Jieren
Wang, Yijue
Li, Ji
Wang, Chenghong
Shang, Chao
Liu, Hang
Rajasekaran, Sanguthevar
Ding, Caiwen
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2021, 2021, : 3600 - 3610
[9] AMMU: A survey of transformer-based biomedical pretrained language models
Kalyan, Katikapalli Subramanyam
Rajasekharan, Ajit
Sangeetha, Sivanesan
JOURNAL OF BIOMEDICAL INFORMATICS, 2022, 126
[10] Transformer-based language models for mental health issues: A survey
Greco, Candida M.
Simeri, Andrea
Tagarelli, Andrea
Zumpano, Ester
PATTERN RECOGNITION LETTERS, 2023, 167 : 204 - 211

← 1 2 3 4 5 →