A clustering framework for lexical normalization of Roman Urdu

被引：2

作者：

Khan, Abdul Rafae ^{[1
,2
]}

Karim, Asim ^{[3
]}

Sajjad, Hassan ^{[4
]}

Kamiran, Faisal ^{[5
]}

Xu, Jia ^{[1
,2
]}

机构：

[1] Stevens Inst Technol, Hoboken, NJ 07030 USA

[2] CUNY, Comp Sci Dept, Grad Ctr, 365 5th Ave, New York, NY 10016 USA

[3] Lahore Univ Management Sci, Lahore 54792, Pakistan

[4] Hamad Bin Khalifa Univ, Qatar Comp Res Inst, Doha, Qatar

[5] Informat Technol Univ, Arfa Software Technol Pk,Ferozepur Rd, Lahore, Pakistan

来源：

NATURAL LANGUAGE ENGINEERING | 2022年 / 28卷 / 01期

基金：

美国国家科学基金会;

关键词：

Text data mining; Similarity; Machine learning; Phonetic encoding; TEXT NORMALIZATION; MODEL;

D O I：

10.1017/S1351324920000285

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script. The similarity function incorporates various phonetic-based, string-based, and contextual features of words. The Lex-Var algorithm is a variant of the k-medoids clustering algorithm that groups lexical variations of words. It contains a similarity threshold to balance the number of clusters and their maximum similarity. The framework allows feature learning and optimization in addition to the use of predefined features and weights. We evaluate our framework extensively on four real-world datasets and show an F-measure gain of up to 15% from baseline methods. We also demonstrate the superiority of UrduPhone and Lex-Var in comparison to respective alternate algorithms in our clustering framework for the lexical normalization of Roman Urdu.

引用

页码：93 / 123

页数：31

共 50 条

[31] Opinion Mining in Roman Urdu using Baseline Classifiers
Sharf, Zareen
Mansoor, Husnain Ali
INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2018, 18 (09): : 156 - 164
[32] Performing Natural Language Processing on Roman Urdu Datasets
Sharf, Zareen
Rahman, Saif Ur
INTERNATIONAL JOURNAL OF COMPUTER SCIENCE AND NETWORK SECURITY, 2018, 18 (01): : 141 - 148
[33] Roman Urdu Sentiment Analysis Using Transfer Learning
Li, Dun
Ahmed, Kanwal
Zheng, Zhiyun
Mohsan, Syed Agha Hassnain
Alsharif, Mohammed H.
Hadjouni, Myriam
Jamjoom, Mona M.
Mostafa, Samih M.
APPLIED SCIENCES-BASEL, 2022, 12 (20):
[34] Detecting Spam Product Reviews in Roman Urdu Script
Hussain N.
Mirza H.T.
Iqbal F.
Hussain I.
Kaleem M.
Computer Journal, 2021, 64 (03): : 432 - 450
[35] Transtech: development of a novel translator for Roman Urdu to English
Masroor, Hafsa
Saeed, Muhammad
Feroz, Maryam
Ahsan, Kamran
Islam, Khawar
HELIYON, 2019, 5 (05)
[36] An Unsupervised Approach for Sentiment Analysis on Social Media Short Text Classification in Roman Urdu Sentiment analysis on short text classification in Roman Urdu
Rana, Toqir A.
Shahzadi, Kiran
Rana, Tauseef
Arshad, Ahsan
Tubishat, Mohammad
ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2022, 21 (02)
[37] Lexical Normalization Model for Noisy SMS Text
Jose, Greety
Raj, Nisha. S.
2014 FIRST INTERNATIONAL CONFERENCE ON COMPUTATIONAL SYSTEMS AND COMMUNICATIONS (ICCSC), 2014, : 57 - 62
[38] TweetNorm: a benchmark for lexical normalization of Spanish tweets
Alegria, Inaki
Aranberri, Nora
Comas, Pere R.
Fresno, Victor
Gamallo, Pablo
Padro, Lluis
San Vicente, Inaki
Turmo, Jordi
Zubiaga, Arkaitz
LANGUAGE RESOURCES AND EVALUATION, 2015, 49 (04) : 883 - 905
[39] Lexical Intent Recognition in Urdu Queries Using Deep Neural Networks
Shams, Sana
Aslam, Muhammad
Maria Martinez-Enriquez, Ana
ADVANCES IN SOFT COMPUTING, MICAI 2019, 2019, 11835 : 39 - 50
[40] General perceptual contributions to lexical tone normalization
Huang, Jingyuan
Holt, Lori L.
JOURNAL OF THE ACOUSTICAL SOCIETY OF AMERICA, 2009, 125 (06): : 3983 - 3994

← 1 2 3 4 5 →