Development of an Efficient Method to Detect Mixed Social Media Data with Tamil-English Code Using Machine Learning Techniques

被引：3

作者：

Fha, Shibly ^{[1
,2
]}

Sharma, Uzzal ^{[1
]}

Naleer, Hmm ^{[3
]}

机构：

[1] Assam Don Bosco Univ, Gauhati, India

[2] South Eastern Univ Sri Lanka, Oluvil, Sri Lanka

[3] South Eastern Univ Sri Lanka, Fac Appl Sci, Dept Comp Sci, Oluvil, Sri Lanka

来源：

ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING | 2023年 / 22卷 / 02期

关键词：

Tamil; English; code mixed; hate speech; machine learning and ensemble classification;

D O I：

10.1145/3563775

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

On social networking sites, online hate speech has become more prevalent due to the quick expansion of mobile computing and Web technology. Previous research has found that being exposed to Internet hate speech has substantial offline implications for historically disadvantaged communities. Therefore, there is a lot of interest in research on automated hate-based comment and post detection. Hate speech can have an influence on any population group, but some are more vulnerable than others. From this background, detecting and reporting such hate related comments and posts can help to avoid the harmful effects of hate speech. There are some studies available on this context and it was found that machine learning algorithms are more efficient in detecting abusive texts in social media. In this research, we applied selected seven machine learning algorithms such as Support Vector Machine (SVM), Naive Bayes (NB), Logistic Regression (LR), Decision Tree (DT), Random Forest (RF), Gradient Boost (GB) and K Nearest Neighbor (KNN) to detect hate speech and compare the performances of those algorithms to develop an ensemble model. Researchers collected and combined Tamil - English code-mixed hate speech tweets dataset which was created in HASOC. This dataset's tweets are divided into two groups: not offensive and offensive. This dataset includes 35,442 tweets. In this research, NB has obtained highest F1 scores in detecting offensive and not offensive tweets with highest weighted average. But SVM has obtained highest accuracy in detecting Tamil - English hate speech texts with 80% in 10-fold cross-validation. Based on the stand-alone performances, researchers developed two ensemble classifiers including max-voting and averaging ensemble. Averaging ensemble classification obtained 90.67% in accuracy. The research study's findings are significant because these results can be applied as a model for Tamil - English code-mixed hate speech to evaluate future research works using various algorithms for identifying hate contents more accurately and professionally.

引用

页数：19

共 50 条

[1] Automatic Hate Speech Detection in English-Odia Code Mixed Social Media Data Using Machine Learning Techniques
Mohapatra, Sudhir Kumar
Prasad, Srinivas
Bebarta, Dwiti Krishna
Das, Tapan Kumar
Srinivasan, Kathiravan
Hu, Yuh-Chung
APPLIED SCIENCES-BASEL, 2021, 11 (18):
[2] Social Media Mining to Detect Online Violent Extremism using Machine Learning Techniques
Mussiraliyeva, Shynar
Bagitova, Kalamkas
Sultan, Daniyar
INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2023, 14 (06) : 1384 - 1393
[3] Sentiment Analysis and Offensive Language Identification in Code-Mixed Tamil-English Languages Using Transformer-Based Models
Ponnambalam, Satheesh Kumar
Desai, Darshana
ADVANCED NETWORK TECHNOLOGIES AND INTELLIGENT COMPUTING, ANTIC 2023, PT III, 2024, 2092 : 149 - 167
[4] Stress Recognition in Code-Mixed Social Media Texts using Machine Learning
Achamaleh, Tewodros
Eyob, Lemlem
Tayyab, Muhammad
Sidorov, Grigori
Batyrshin, Ildar
INTERNATIONAL JOURNAL OF COMBINATORIAL OPTIMIZATION PROBLEMS AND INFORMATICS, 2024, 15 (01): : 32 - 38
[5] An analysis of machine learning models for sentiment analysis of Tamil code-mixed data
Shanmugavadivel, Kogilavani
Sampath, Sai Haritha
Nandhakumar, Pramod
Mahalingam, Prasath
Subramanian, Malliga
Kumaresan, Prasanna Kumar
Priyadharshini, Ruba
COMPUTER SPEECH AND LANGUAGE, 2022, 76
[6] Social media text analytics of Malayalam–English code-mixed using deep learning
S. Thara
Prabaharan Poornachandran
Journal of Big Data, 9
[7] Social media text analytics of Malayalam-English code-mixed using deep learning
Thara, S.
Poornachandran, Prabaharan
JOURNAL OF BIG DATA, 2022, 9 (01)
[8] Rumor Detection Using Machine Learning Techniques on Social Media
Kumar, Akshi
Sangwan, Saurabh Raj
INTERNATIONAL CONFERENCE ON INNOVATIVE COMPUTING AND COMMUNICATIONS, VOL 2, 2019, 56 : 213 - 221
[9] Efficient English text classification using selected Machine Learning Techniques
Luo, Xiaoyu
ALEXANDRIA ENGINEERING JOURNAL, 2021, 60 (03) : 3401 - 3409
[10] Social media data analysis to predict mental state of users using machine learning techniques
Lokeshkumar, R.
Mishra, Om Ashish
Kalra, Shivam
JOURNAL OF EDUCATION AND HEALTH PROMOTION, 2021, 10 (01)

← 1 2 3 4 5 →