Roman Urdu toxic comment classification

被引:10
|
作者
Saeed, Hafiz Hassaan [1 ]
Ashraf, Muhammad Haseeb [1 ]
Kamiran, Faisal [1 ]
Karim, Asim [2 ]
Calders, Toon [3 ]
机构
[1] Informat Technol Univ, Lahore, Pakistan
[2] Lahore Univ Management Sci, Lahore, Pakistan
[3] Univ Antwerp, Antwerp, Belgium
关键词
Roman Urdu; Toxic comment classification; Deep learning; Roman Urdu toxic comments; Deep ensemble; TWITTER;
D O I
10.1007/s10579-021-09530-y
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
With the increasing popularity of user-generated content on social media, the number of toxic texts is also on the rise. Such texts cause adverse effects on users and society at large, therefore, the identification of toxic comments is a growing need of the day. While toxic comment classification has been studied for resource-rich languages like English, no work has been done for Roman Urdu despite being a widely used language on social media in South Asia. This paper addresses the challenge of Roman Urdu toxic comment detection by developing a first-ever large labeled corpus of toxic and non-toxic comments. The developed corpus, called RUT (Roman Urdu Toxic), contains over 72 thousand comments collected from popular social media platforms and has been labeled manually with a strong inter-annotator agreement. With this dataset, we train several classification models to detect Roman Urdu toxic comments, including classical machine learning models with the bag-of-words representation and some recent deep models based on word embeddings. Despite the success of the latter in classifying toxic comments in English, the absence of pre-trained word embeddings for Roman Urdu prompted to generate different word embeddings using Glove, Word2Vec and FastText techniques, and compare them with task-specific word embeddings learned inside the classification task. Finally, we propose an ensemble approach, reaching our best F1-score of 86.35%, setting the first-ever benchmark for toxic comment classification in Roman Urdu.
引用
收藏
页码:971 / 996
页数:26
相关论文
共 50 条
  • [1] Roman Urdu toxic comment classification
    Hafiz Hassaan Saeed
    Muhammad Haseeb Ashraf
    Faisal Kamiran
    Asim Karim
    Toon Calders
    [J]. Language Resources and Evaluation, 2021, 55 : 971 - 996
  • [2] An Unsupervised Approach for Sentiment Analysis on Social Media Short Text Classification in Roman Urdu Sentiment analysis on short text classification in Roman Urdu
    Rana, Toqir A.
    Shahzadi, Kiran
    Rana, Tauseef
    Arshad, Ahsan
    Tubishat, Mohammad
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (02)
  • [3] Roman-Urdu-Parl: Roman-Urdu and Urdu Parallel Corpus for Urdu Language Understanding
    Alam, Mehreen
    Ul Hussain, Sibt
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (01)
  • [4] Roman Urdu News Headline Classification Empowered with Machine Learning
    Naqvi, Rizwan Ali
    Khan, Muhammad Adnan
    Malik, Nauman
    Saqib, Shazia
    Alyas, Tahir
    Hussain, Dildar
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2020, 65 (02): : 1221 - 1236
  • [5] Roman Urdu Slang Dictionary Development for Facebook Comment Sentiment Analysis
    [J]. 1600, Institute of Electrical and Electronics Engineers Inc.
  • [6] Automatic Detection of Offensive Language for Urdu and Roman Urdu
    Akhter, Muhammad Pervez
    Zheng Jiangbin
    Naqvi, Irfan Raza
    Abdelmajeed, Mohammed
    Sadiq, Muhammad Tariq
    [J]. IEEE ACCESS, 2020, 8 : 91213 - 91226
  • [7] Roman Urdu Headline News Text Classification Using RNN, LSTM and CNN
    Kandhro, Irfan Ali
    Jumani, Sahar Zafar
    Kumar, Kamlash
    Hafeez, Abdul
    Ali, Fayyaz
    [J]. ADVANCES IN DATA SCIENCE AND ADAPTIVE ANALYSIS, 2020, 12 (02)
  • [8] Sentiment Analysis for Roman Urdu
    Rafique, Ayesha
    Malik, Muhammad Kamran
    Nawaz, Zubair
    Bukhari, Faisal
    Jalbani, Akhtar Hussain
    [J]. MEHRAN UNIVERSITY RESEARCH JOURNAL OF ENGINEERING AND TECHNOLOGY, 2019, 38 (02) : 463 - 470
  • [9] Sequence to Sequence Networks for Roman-Urdu to Urdu Transliteration
    Alam, Mehreen
    Hussain, Sibt Ul
    [J]. 2017 INTERNATIONAL MULTI-TOPIC CONFERENCE (INMIC), 2017,
  • [10] A Review of Urdu Sentiment Analysis with Multilingual Perspective: A Case of Urdu and Roman Urdu Language
    Khan, Ihsan Ullah
    Khan, Aurangzeb
    Khan, Wahab
    Su'ud, Mazliham Mohd
    Alam, Muhammad Mansoor
    Subhan, Fazli
    Asghar, Muhammad Zubair
    [J]. COMPUTERS, 2022, 11 (01)