Towards Offensive Language Identification for Tamil Code-Mixed YouTube Comments and Posts

被引:0
|
作者
Charangan Vasantharajan
Uthayasanker Thayasivam
机构
[1] University of Moratuwa,Department of Computer Science and Engineering
关键词
Offensive language; Code-mixed; Transformers; Tamil;
D O I
10.1007/s42979-021-00977-y
中图分类号
学科分类号
摘要
Offensive Language detection in social media platforms has been an active field of research over the past years. In non-native English-speaking countries, social media users mostly use a code-mixed form of text in their posts/comments. This poses several challenges for offensive content identification tasks and considering the low resources available for the Tamil language, the task becomes much more challenging. The current study presents extensive experiments using multiple deep learning and transfers learning models to detect offensive content on YouTube. We propose a novel and flexible approach of selective translation and transliteration techniques to reap better results from fine-tuning and ensembling multilingual transformer networks like BERT, DistilBERT, and XLM-RoBERTa. The experimental results showed that ULMFiT is the best model for this task. The best performing models were ULMFiT and mBERT-BiLSTM for this Tamil code-mix dataset instead of more popular transfer learning models such as DistilBERT and XLM-RoBERTa and hybrid deep learning models. The proposed model ULMFiT and mBERT-BiLSTM yielded good results and are promising for effective offensive speech identification in low-resourced languages.
引用
收藏
相关论文
共 50 条
  • [1] Findings of the Shared Task on Offensive Span Identification from Code-Mixed Tamil-English Comments
    Ravikiran, Manikandan
    Chakravarthi, Bharathi Raja
    Madasamy, Anand Kumar
    Sivanesan, Sangeetha
    Rajalakshmi, Ratnavel
    Thavareesan, Sajeetha
    Ponnusamy, Rahul
    Mahadevan, Shankar
    [J]. PROCEEDINGS OF THE SECOND WORKSHOP ON SPEECH AND LANGUAGE TECHNOLOGIES FOR DRAVIDIAN LANGUAGES (DRAVIDIANLANGTECH 2022), 2022, : 261 - 270
  • [2] Sentiment Analysis and Offensive Language Identification in Code-Mixed Tamil-English Languages Using Transformer-Based Models
    Ponnambalam, Satheesh Kumar
    Desai, Darshana
    [J]. ADVANCED NETWORK TECHNOLOGIES AND INTELLIGENT COMPUTING, ANTIC 2023, PT III, 2024, 2092 : 149 - 167
  • [3] Deep learning based sentiment analysis and offensive language identification on multilingual code-mixed data
    Shanmugavadivel, Kogilavani
    Sathishkumar, V. E.
    Raja, Sandhiya
    Lingaiah, T. Bheema
    Neelakandan, S.
    Subramanian, Malliga
    [J]. SCIENTIFIC REPORTS, 2022, 12 (01)
  • [4] DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
    Bharathi Raja Chakravarthi
    Ruba Priyadharshini
    Vigneshwaran Muralidaran
    Navya Jose
    Shardul Suryawanshi
    Elizabeth Sherly
    John P. McCrae
    [J]. Language Resources and Evaluation, 2022, 56 : 765 - 806
  • [5] Deep learning based sentiment analysis and offensive language identification on multilingual code-mixed data
    Kogilavani Shanmugavadivel
    V. E. Sathishkumar
    Sandhiya Raja
    T. Bheema Lingaiah
    S. Neelakandan
    Malliga Subramanian
    [J]. Scientific Reports, 12
  • [6] DravidianCodeMix: sentiment analysis and offensive language identification dataset for Dravidian languages in code-mixed text
    Chakravarthi, Bharathi Raja
    Priyadharshini, Ruba
    Muralidaran, Vigneshwaran
    Jose, Navya
    Suryawanshi, Shardul
    Sherly, Elizabeth
    McCrae, John P.
    [J]. LANGUAGE RESOURCES AND EVALUATION, 2022, 56 (03) : 765 - 806
  • [7] Meta-Learning for Offensive Language Detection in Code-Mixed Texts
    Suresh, Gautham Vadakkekara
    Chakravarthi, Bharathi Raja
    McCrae, John P.
    [J]. FIRE 2021: PROCEEDINGS OF THE 13TH ANNUAL MEETING OF THE FORUM FOR INFORMATION RETRIEVAL EVALUATION, 2021, : 58 - 66
  • [8] MHE: Code-Mixed Corpora for Similar Language Identification
    Rani, Priya
    McCrae, John P.
    Fransen, Theodorus
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 3425 - 3433
  • [9] Language Identification and Transliteration approaches for Code-Mixed Text
    Kumbhar M.
    Thakre K.
    [J]. Journal of Engineering Science and Technology Review, 2024, 17 (01) : 63 - 70
  • [10] Offensive language detection in Tamil YouTube comments by adapters and cross-domain knowledge transfer
    Subramanian, Malliga
    Ponnusamy, Rahul
    Benhur, Sean
    Shanmugavadivel, Kogilavani
    Ganesan, Adhithiya
    Ravi, Deepti
    Shanmugasundaram, Gowtham Krishnan
    Priyadharshini, Ruba
    Chakravarthi, Bharathi Raja
    [J]. COMPUTER SPEECH AND LANGUAGE, 2022, 76