A clustering framework for lexical normalization of Roman Urdu

被引：2

作者：

Khan, Abdul Rafae ^{[1
,2
]}

Karim, Asim ^{[3
]}

Sajjad, Hassan ^{[4
]}

Kamiran, Faisal ^{[5
]}

Xu, Jia ^{[1
,2
]}

机构：

[1] Stevens Inst Technol, Hoboken, NJ 07030 USA

[2] CUNY, Comp Sci Dept, Grad Ctr, 365 5th Ave, New York, NY 10016 USA

[3] Lahore Univ Management Sci, Lahore 54792, Pakistan

[4] Hamad Bin Khalifa Univ, Qatar Comp Res Inst, Doha, Qatar

[5] Informat Technol Univ, Arfa Software Technol Pk,Ferozepur Rd, Lahore, Pakistan

来源：

NATURAL LANGUAGE ENGINEERING | 2022年 / 28卷 / 01期

基金：

美国国家科学基金会;

关键词：

Text data mining; Similarity; Machine learning; Phonetic encoding; TEXT NORMALIZATION; MODEL;

D O I：

10.1017/S1351324920000285

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script. The similarity function incorporates various phonetic-based, string-based, and contextual features of words. The Lex-Var algorithm is a variant of the k-medoids clustering algorithm that groups lexical variations of words. It contains a similarity threshold to balance the number of clusters and their maximum similarity. The framework allows feature learning and optimization in addition to the use of predefined features and weights. We evaluate our framework extensively on four real-world datasets and show an F-measure gain of up to 15% from baseline methods. We also demonstrate the superiority of UrduPhone and Lex-Var in comparison to respective alternate algorithms in our clustering framework for the lexical normalization of Roman Urdu.

引用

页码：93 / 123

页数：31

共 50 条

[21] Lexical Normalization of Spanish Tweets
Ceron-Guzman, Jhon Adrian
Leon-Guzman, Elizabeth
PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16 COMPANION), 2016, : 605 - 610
[22] A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media
Dung Ha Nguyen
Anh Thi Hoang Nguyen
Kiet Van Nguyen
Cognitive Computation, 2025, 17 (1)
[23] An Acoustic Investigation of Primary and Secondary Lexical Stress of Urdu
Ul Ain, Qurrat
Mahmood, Muhammad Asim
Raza, Syed Muhammad Muslim
Zakir, Anam
GEMA ONLINE JOURNAL OF LANGUAGE STUDIES, 2023, 23 (01): : 74 - 92
[24] RUSAS: Roman Urdu Sentiment Analysis System
Jawad, Kazim
Ahmad, Muhammad
Alvi, Majdah
Alvi, Muhammad Bux
CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 79 (01): : 1463 - 1480
[25] RUTUT: Roman Urdu to Urdu Translator Based on Character Substitution Rules and Unicode Mapping
Shahroz, Mobeen
Mushtaq, Muhammad Faheem
Mehmood, Arif
Ullah, Saleem
Choi, Gyu Sang
IEEE ACCESS, 2020, 8 : 189823 - 189841
[26] Multilingual Detection of Cyberbullying in Mixed Urdu, Roman Urdu, and English Social Media Conversations
Razi, Fakhra
Ejaz, Naveed
IEEE ACCESS, 2024, 12 : 105201 - 105210
[27] SPEAKER NORMALIZATION IN PERCEPTION OF LEXICAL TONE
LEATHER, J
JOURNAL OF PHONETICS, 1983, 11 (04) : 373 - 382
[28] Is vowel normalization independent of lexical processing?
Mitterer, Holger
PHONETICA, 2006, 63 (04) : 209 - 229
[29] Lexical Normalization for Social Media Text
Han, Bo
Cook, Paul
Baldwin, Timothy
ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2013, 4 (01)
[30] Tools for nominalization: An alternative for lexical normalization
Insaurriaga Gonzalez, Marco Antonio
de Lima, Vera Lucia Strube
de Lima, Jose Valdeni
COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROCEEDINGS, 2006, 3960 : 100 - 109

← 1 2 3 4 5 →