Filtering offensive language from multilingual social media contents: A deep learning approach

被引:1
|
作者
Saumya, Sunil [1 ]
Kumar, Abhinav [2 ]
Singh, Jyoti Prakash [3 ]
机构
[1] Indian Inst Informat Technol Dharwad, Dharwad, Karnataka, India
[2] Motilal Nehru Natl Inst Technol Allahabad, Prayagraj, Uttar Pradesh, India
[3] Natl Inst Technol Patna, Patna, Bihar, India
关键词
Social media; Offensive language; Machine learning; Offensive content; Deep learning; Multilingual and bilingual; Hate speech; Code-mixed;
D O I
10.1016/j.engappai.2024.108159
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the face of uncontrolled offensive content on social media, automated detection emerges as a critical need. This paper tackles this challenge by proposing a novel approach for identifying offensive language in multilingual, code-mixed, and script-mixed settings. The study presents a novel multilingual hybrid dataset constructed by merging diverse monolingual and bilingual resources. Further, we systematically evaluate the impact of input representations (Word2Vec, Global Vectors for Word Representation (or GloVe), Bidirectional Encoder Representations from Transformers (or BERT), and uniform initialization) and deep learning models (Convolutional Neural Network (or CNN), Bidirectional Long Short Term Memory (or Bi-LSTM), Bi-LSTMAttention, and fine-tuned BERT) on detection accuracy. Our comprehensive experiments on a dataset of 42,560 social media comments from five languages (English, Hindi, German, Tamil, and Malayalam) reveal the superiority of fine-tuned BERT. Notably, it achieves a macro average F1-score of 0.79 for monolingual tasks and an impressive 0.86 for code-mixed and script-mixed tasks. These findings significantly advance offensive language detection methodologies and shed light on the complex dynamics of multilingual social media, paving the way for more inclusive and safer online communities.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] EFFECTIVE OFFENSIVE LANGUAGE DEDUCTION USING DEEP LEARNING IN SOCIAL MEDIA
    Adaikkan, Kalaivani
    Thenmozhi, Duraio
    [J]. REVUE ROUMAINE DES SCIENCES TECHNIQUES-SERIE ELECTROTECHNIQUE ET ENERGETIQUE, 2024, 69 (02): : 201 - 206
  • [2] Deep-BERT: Transfer Learning for Classifying Multilingual Offensive Texts on Social Media
    Wadud, Md Anwar Hussen
    Mridha, M. F.
    Shin, Jungpil
    Nur, Kamruddin
    Saha, Aloke Kumar
    [J]. COMPUTER SYSTEMS SCIENCE AND ENGINEERING, 2023, 44 (02): : 1775 - 1791
  • [3] Offensive Language Detection on Social Media using Machine Learning
    Abdrakhmanov, Rustam
    Kenesbayev, Serik Muktarovich
    Berkimbayev, Kamalbek
    Toikenov, Gumyrbek
    Abdrashova, Elmira
    Alchinbayeva, Oichagul
    Ydyrys, Aizhan
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2024, 15 (05) : 575 - 582
  • [4] Offensive Language Recognition in Social Media
    Shushkevich, Elena
    Cardiff, John
    Rosso, Paolo
    Akhtyamova, Liliya
    [J]. COMPUTACION Y SISTEMAS, 2020, 24 (02): : 523 - 532
  • [5] Advancing offensive language detection in Arabic social media: a BERT-based ensemble learning approach
    Mazari, Ahmed Cherif
    Benterkia, Asmaa
    Takdenti, Zineb
    [J]. SOCIAL NETWORK ANALYSIS AND MINING, 2024, 14 (01)
  • [6] Offensive Language Detection in Nepali Social Media
    Niraula, Nobal B.
    Dulal, Saurab
    Koirala, Diwa
    [J]. WOAH 2021: THE 5TH WORKSHOP ON ONLINE ABUSE AND HARMS, 2021, : 67 - 75
  • [7] A Corpus of Turkish Offensive Language on Social Media
    Coltekin, Cagri
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 6174 - 6184
  • [8] Deep learning based sentiment analysis and offensive language identification on multilingual code-mixed data
    Shanmugavadivel, Kogilavani
    Sathishkumar, V. E.
    Raja, Sandhiya
    Lingaiah, T. Bheema
    Neelakandan, S.
    Subramanian, Malliga
    [J]. SCIENTIFIC REPORTS, 2022, 12 (01)
  • [9] A Dataset of Offensive Language in Kosovo Social Media
    Ajvazi, Adem
    Hardmeier, Christian
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1860 - 1869
  • [10] Deep learning based sentiment analysis and offensive language identification on multilingual code-mixed data
    Kogilavani Shanmugavadivel
    V. E. Sathishkumar
    Sandhiya Raja
    T. Bheema Lingaiah
    S. Neelakandan
    Malliga Subramanian
    [J]. Scientific Reports, 12