Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts

被引：0

作者：

Charangan Vasantharajan

Uthayasanker Thayasivam

机构：

[1] University of Moratuwa,Department of Computer Science and Engineering

来源：

SN Computer Science | 2022年 / 3卷 / 1期

关键词：

Offensive language; Code-mixed; Transformers; Tamil;

D O I：

10.1007/s42979-021-00977-y

中图分类号：

学科分类号：

摘要：

Offensive Language detection in social media platforms has been an active field of research over the past years. In non-native English-speaking countries, social media users mostly use a code-mixed form of text in their posts/comments. This poses several challenges for offensive content identification tasks and considering the low resources available for the Tamil language, the task becomes much more challenging. The current study presents extensive experiments using multiple deep learning and transfers learning models to detect offensive content on YouTube. We propose a novel and flexible approach of selective translation and transliteration techniques to reap better results from fine-tuning and ensembling multilingual transformer networks like BERT, DistilBERT, and XLM-RoBERTa. The experimental results showed that ULMFiT is the best model for this task. The best performing models were ULMFiT and mBERT-BiLSTM for this Tamil code-mix dataset instead of more popular transfer learning models such as DistilBERT and XLM-RoBERTa and hybrid deep learning models. The proposed model ULMFiT and mBERT-BiLSTM yielded good results and are promising for effective offensive speech identification in low-resourced languages.

引用

共 50 条

[1] Findings of the Shared Task on Offensive Span Identification from Code-Mixed Tamil-English Comments
Ravikiran, Manikandan
Chakravarthi, Bharathi Raja
Madasamy, Anand Kumar
Sivanesan, Sangeetha
Rajalakshmi, Ratnavel
Thavareesan, Sajeetha
Ponnusamy, Rahul
Mahadevan, Shankar
[J]. PROCEEDINGS OF THE SECOND WORKSHOP ON SPEECH AND LANGUAGE TECHNOLOGIES FOR DRAVIDIAN LANGUAGES (DRAVIDIANLANGTECH 2022), 2022, : 261 - 270
[2] Sentiment Analysis and Offensive Language Identification in Code-Mixed Tamil-English Languages Using Transformer-Based Models
Ponnambalam, Satheesh Kumar
Desai, Darshana
[J]. ADVANCED NETWORK TECHNOLOGIES AND INTELLIGENT COMPUTING, ANTIC 2023, PT III, 2024, 2092 : 149 - 167
[3] Deep learning based sentiment analysis and offensive language identification on multilingual code-mixed data
Shanmugavadivel, Kogilavani
Sathishkumar, V. E.
Raja, Sandhiya
Lingaiah, T. Bheema
Neelakandan, S.
Subramanian, Malliga
[J]. SCIENTIFIC REPORTS, 2022, 12 (01)
[4] DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
Bharathi Raja Chakravarthi
Ruba Priyadharshini
Vigneshwaran Muralidaran
Navya Jose
Shardul Suryawanshi
Elizabeth Sherly
John P. McCrae
[J]. Language Resources and Evaluation, 2022, 56 : 765 - 806
[5] Deep learning based sentiment analysis and offensive language identification on multilingual code-mixed data
Kogilavani Shanmugavadivel
V. E. Sathishkumar
Sandhiya Raja
T. Bheema Lingaiah
S. Neelakandan
Malliga Subramanian
[J]. Scientific Reports, 12
[6] DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
Chakravarthi, Bharathi Raja
Priyadharshini, Ruba
Muralidaran, Vigneshwaran
Jose, Navya
Suryawanshi, Shardul
Sherly, Elizabeth
McCrae, John P.
[J]. LANGUAGE RESOURCES AND EVALUATION, 2022, 56 (03) : 765 - 806
[7] Meta-Learning for Offensive Language Detection in Code-Mixed Texts
Suresh, Gautham Vadakkekara
Chakravarthi, Bharathi Raja
McCrae, John P.
[J]. FIRE 2021: PROCEEDINGS OF THE 13TH ANNUAL MEETING OF THE FORUM FOR INFORMATION RETRIEVAL EVALUATION, 2021, : 58 - 66
[8] MHE: Code-Mixed Corpora for Similar Language Identification
Rani, Priya
McCrae, John P.
Fransen, Theodorus
[J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3425 - 3433
[9] Language Identification and Transliteration approaches for Code-Mixed Text
Kumbhar M.
Thakre K.
[J]. Journal of Engineering Science and Technology Review, 2024, 17 (01) : 63 - 70
[10] Offensive language detection in Tamil YouTube comments by adapters and cross-domain knowledge transfer
Subramanian, Malliga
Ponnusamy, Rahul
Benhur, Sean
Shanmugavadivel, Kogilavani
Ganesan, Adhithiya
Ravi, Deepti
Shanmugasundaram, Gowtham Krishnan
Priyadharshini, Ruba
Chakravarthi, Bharathi Raja
[J]. COMPUTER SPEECH AND LANGUAGE, 2022, 76

← 1 2 3 4 5 →