Automatic Detection of Offensive Language for Urdu and Roman Urdu

被引:51
|
作者
Akhter, Muhammad Pervez [1 ]
Zheng Jiangbin [1 ]
Naqvi, Irfan Raza [1 ]
Abdelmajeed, Mohammed [2 ]
Sadiq, Muhammad Tariq [3 ]
机构
[1] Northwestern Polytech Univ, Sch Software & Microelect, Xian 710072, Peoples R China
[2] Northwestern Polytech Univ, Sch Comp Sci & Technol, Xian 710072, Peoples R China
[3] Northwestern Polytech Univ, Sch Automat, Xian 710072, Peoples R China
来源
IEEE ACCESS | 2020年 / 8卷 / 08期
基金
中国国家自然科学基金;
关键词
Machine learning; YouTube; Feature extraction; Videos; Writing; Twitter; Social media; offensive language detection; natural language Processing; machine learning; text processing; ONLINE COMMUNICATION; HATE SPEECH; CLASSIFICATION; TWITTER;
D O I
10.1109/ACCESS.2020.2994950
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In recent years, unethical behavior in the cyber-environment has been revealed. The presence of offensive language on social media platforms and automatic detection of such language is becoming a major challenge in modern society. The complexity of natural language constructs makes this task even more challenging. Until now, most of the research has focused on resource-rich languages like English. Roman Urdu and Urdu are two scripts of writing the Urdu language on social media. The Roman script uses the English language characters while the Urdu script uses Urdu language characters. Urdu and Hindi languages are similar with the only difference in their writing script but the Roman scripts of both languages are similar. This study is about the detection of offensive language from the users comments presented in a resource-poor language Urdu. We propose the first offensive dataset of Urdu containing user-generated comments from social media. We use individual and combined n-grams techniques to extract features at character-level and word-level. We apply seventeen classifiers from seven machine learning techniques to detect offensive language from both Urdu and Roman Urdu text comments. Experiments show that the regression-based models using character n-grams show superior performance to process the Urdu language. Character-level tri-gram outperforms the other word and character n-grams. LogitBoost and SimpleLogistic outperform the other models and achieve 99.2 and 95.9 values of F-measure on Roman Urdu and Urdu datasets respectively. Our designed dataset is publically available on GitHub for future research.
引用
收藏
页码:91213 / 91226
页数:14
相关论文
共 50 条
  • [1] Hate-Speech and Offensive Language Detection in Roman Urdu
    Rizwan, Hammad
    Shakeel, Muhammad Haroon
    Karim, Asim
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 2512 - 2522
  • [2] Roman-Urdu-Parl: Roman-Urdu and Urdu Parallel Corpus for Urdu Language Understanding
    Alam, Mehreen
    Ul Hussain, Sibt
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (01)
  • [3] Automatic Abusive Language Detection in Urdu Tweets
    Amjad, Maaz
    Ashraf, Noman
    Sidorov, Grigori
    Zhila, Alisa
    Chanona-Hernandez, Liliana
    Gelbukh, Alexander
    [J]. ACTA POLYTECHNICA HUNGARICA, 2022, 19 (10) : 143 - 163
  • [4] A Review of Urdu Sentiment Analysis with Multilingual Perspective: A Case of Urdu and Roman Urdu Language
    Khan, Ihsan Ullah
    Khan, Aurangzeb
    Khan, Wahab
    Su'ud, Mazliham Mohd
    Alam, Muhammad Mansoor
    Subhan, Fazli
    Asghar, Muhammad Zubair
    [J]. COMPUTERS, 2022, 11 (01)
  • [5] Corpus for Emotion Detection on Roman Urdu
    Arshad, Muhammad Umair
    Bashir, Muhammad Farrukh
    Majeed, Adil
    Shahzad, Waseem
    Beg, Mirza Omer
    [J]. 2019 22ND IEEE INTERNATIONAL MULTI TOPIC CONFERENCE (INMIC), 2019, : 164 - 169
  • [6] Hate Speech Detection in Roman Urdu
    Khan, Muhammad Moin
    Shahzad, Khurram
    Malik, Muhammad Kamran
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2021, 20 (01)
  • [7] A survey of automatic Urdu language processing
    Anwar, Waqas
    Wang, Xuan
    Wang, Xiao-Long
    [J]. PROCEEDINGS OF 2006 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-7, 2006, : 4489 - +
  • [8] Identification of offensive language in Urdu using semantic and embedding models
    Hussain, Sajid
    Malik, Muhammad Shahid Iqbal
    Masood, Nayyer
    [J]. PeerJ Computer Science, 2022, 8
  • [9] Identification of offensive language in Urdu using semantic and embedding models
    Hussain, Sajid
    Malik, Muhammad Shahid Iqbal
    Masood, Nayyer
    [J]. PEERJ COMPUTER SCIENCE, 2022, 8
  • [10] Multilingual Detection of Cyberbullying in Mixed Urdu, Roman Urdu, and English Social Media Conversations
    Razi, Fakhra
    Ejaz, Naveed
    [J]. IEEE ACCESS, 2024, 12 : 105201 - 105210