Roman Urdu News Headline Classification Empowered with Machine Learning

被引：9

作者：

Naqvi, Rizwan Ali ^{[1
]}

Khan, Muhammad Adnan ^{[2
]}

Malik, Nauman ^{[2
]}

Saqib, Shazia ^{[2
]}

Alyas, Tahir ^{[2
]}

Hussain, Dildar ^{[3
]}

机构：

[1] Sejong Univ, Dept Unmanned Vehicle Engn, Seoul 05006, South Korea

[2] Lahore Garrison Univ, Dept Comp Sci, Lahore 54000, Pakistan

[3] Korea Inst Adv Study, Sch Computat Sci, Seoul 02455, South Korea

来源：

CMC-COMPUTERS MATERIALS & CONTINUA | 2020年 / 65卷 / 02期

关键词：

Roman urdu; news headline classification; long short term memory; recurrent neural network; logistic regression; multinomial naive Bayes; random forest; k neighbor; gradient boosting classifier; SENTIMENT ANALYSIS;

D O I：

10.32604/cmc.2020.011686

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Roman Urdu has been used for text messaging over the Internet for years especially in Indo-Pak Subcontinent. Persons from the subcontinent may speak the same Urdu language but they might be using different scripts for writing. The communication using the Roman characters, which are used in the script of Urdu language on social media, is now considered the most typical standard of communication in an Indian landmass that makes it an expensive information supply. English Text classification is a solved problem but there have been only a few efforts to examine the rich information supply of Roman Urdu in the past. This is due to the numerous complexities involved in the processing of Roman Urdu data. The complexities associated with Roman Urdu include the non-availability of the tagged corpus, lack of a set of rules, and lack of standardized spellings. A large amount of Roman Urdu news data is available on mainstream news websites and social media websites like Facebook, Twitter but meaningful information can only be extracted if data is in a structured format. We have developed a Roman Urdu news headline classifier, which will help to classify news into relevant categories on which further analysis and modeling can be done. The author of this research aims to develop the Roman Urdu news classifier, which will classify the news into five categories (health, business, technology, sports, international). First, we will develop the news dataset using scraping tools and then after preprocessing, we will compare the results of different machine learning algorithms like Logistic Regression (LR), Multinomial Naive Bayes (MNB), Long short term memory (LSTM), and Convolutional Neural Network (CNN). After this, we will use a phonetic algorithm to control lexical variation and test news from different websites. The preliminary results suggest that a more accurate classification can be accomplished by monitoring noise inside data and by classifying the news. After applying above mentioned different machine learning algorithms, results have shown that Multinomial Naive Bayes classifier is giving the best accuracy of 90.17% which is due to the noise lexical variation.

引用

页码：1221 / 1236

页数：16

共 50 条

[1] Urdu News Classification using Application of Machine Learning Algorithms on News Headline
Khan, Muhammad Badruddin
INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2021, 21 (02): : 229 - 237
[2] Roman Urdu Headline News Text Classification Using RNN, LSTM and CNN
Kandhro, Irfan Ali
Jumani, Sahar Zafar
Kumar, Kamlash
Hafeez, Abdul
Ali, Fayyaz
ADVANCES IN DATA SCIENCE AND ADAPTIVE ANALYSIS, 2020, 12 (02)
[3] Semantic Analysis of Urdu English Tweets Empowered by Machine Learning
Tabassum, Nadia
Alyas, Tahir
Hamid, Muhammad
Saleem, Muhammad
Malik, Saadia
Ali, Zain
Farooq, Umer
INTELLIGENT AUTOMATION AND SOFT COMPUTING, 2021, 30 (01): : 175 - 186
[4] Emotion Detection in Roman Urdu Text using Machine Learning
Majeed, Adil
Mujtaba, Hasan
Beg, Mirza Omer
2020 35TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING WORKSHOPS (ASEW 2020), 2020, : 125 - 130
[5] Fake news detection in Urdu language using machine learning
Farooq, Muhammad Shoaib
Naseem, Ansar
Rustam, Furqan
Ashraf, Imran
PEERJ COMPUTER SCIENCE, 2023, 9
[6] Examining Machine Learning Techniques in Business News Headline Sentiment Analysis
Lim, Seong Liang Ooi
Lim, Hooi Mei
Tan, Eng Kee
Tan, Tien-Ping
COMPUTATIONAL SCIENCE AND TECHNOLOGY (ICCST 2019), 2020, 603 : 363 - 372
[7] A Multiple Learning Model Based Voting System for News Headline Classification
Zhu, Fenhong
Dong, Xiaozheng
Song, Rui
Hong, Yu
Zhu, Qiaoming
NATURAL LANGUAGE PROCESSING AND CHINESE COMPUTING, NLPCC 2017, 2018, 10619 : 797 - 806
[8] Roman Urdu toxic comment classification
Hafiz Hassaan Saeed
Muhammad Haseeb Ashraf
Faisal Kamiran
Asim Karim
Toon Calders
Language Resources and Evaluation, 2021, 55 : 971 - 996
[9] Roman Urdu toxic comment classification
Saeed, Hafiz Hassaan
Ashraf, Muhammad Haseeb
Kamiran, Faisal
Karim, Asim
Calders, Toon
LANGUAGE RESOURCES AND EVALUATION, 2021, 55 (04) : 971 - 996
[10] Hybrid Machine Learning and Deep Learning Approaches for Insult Detection in Roman Urdu Text
Hussain, Nisar
Qasim, Amna
Mehak, Gull
Kolesnikova, Olga
Gelbukh, Alexander
Sidorov, Grigori
AI, 2025, 6 (02)

← 1 2 3 4 5 →