A clustering framework for lexical normalization of Roman Urdu

被引:2
|
作者
Khan, Abdul Rafae [1 ,2 ]
Karim, Asim [3 ]
Sajjad, Hassan [4 ]
Kamiran, Faisal [5 ]
Xu, Jia [1 ,2 ]
机构
[1] Stevens Inst Technol, Hoboken, NJ 07030 USA
[2] CUNY, Comp Sci Dept, Grad Ctr, 365 5th Ave, New York, NY 10016 USA
[3] Lahore Univ Management Sci, Lahore 54792, Pakistan
[4] Hamad Bin Khalifa Univ, Qatar Comp Res Inst, Doha, Qatar
[5] Informat Technol Univ, Arfa Software Technol Pk,Ferozepur Rd, Lahore, Pakistan
基金
美国国家科学基金会;
关键词
Text data mining; Similarity; Machine learning; Phonetic encoding; TEXT NORMALIZATION; MODEL;
D O I
10.1017/S1351324920000285
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Roman Urdu is an informal form of the Urdu language written in Roman script, which is widely used in South Asia for online textual content. It lacks standard spelling and hence poses several normalization challenges during automatic language processing. In this article, we present a feature-based clustering framework for the lexical normalization of Roman Urdu corpora, which includes a phonetic algorithm UrduPhone, a string matching component, a feature-based similarity function, and a clustering algorithm Lex-Var. UrduPhone encodes Roman Urdu strings to their pronunciation-based representations. The string matching component handles character-level variations that occur when writing Urdu using Roman script. The similarity function incorporates various phonetic-based, string-based, and contextual features of words. The Lex-Var algorithm is a variant of the k-medoids clustering algorithm that groups lexical variations of words. It contains a similarity threshold to balance the number of clusters and their maximum similarity. The framework allows feature learning and optimization in addition to the use of predefined features and weights. We evaluate our framework extensively on four real-world datasets and show an F-measure gain of up to 15% from baseline methods. We also demonstrate the superiority of UrduPhone and Lex-Var in comparison to respective alternate algorithms in our clustering framework for the lexical normalization of Roman Urdu.
引用
收藏
页码:93 / 123
页数:31
相关论文
共 50 条
  • [21] Lexical Normalization of Spanish Tweets
    Ceron-Guzman, Jhon Adrian
    Leon-Guzman, Elizabeth
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16 COMPANION), 2016, : 605 - 610
  • [22] A Weakly Supervised Data Labeling Framework for Machine Lexical Normalization in Vietnamese Social Media
    Dung Ha Nguyen
    Anh Thi Hoang Nguyen
    Kiet Van Nguyen
    Cognitive Computation, 2025, 17 (1)
  • [23] An Acoustic Investigation of Primary and Secondary Lexical Stress of Urdu
    Ul Ain, Qurrat
    Mahmood, Muhammad Asim
    Raza, Syed Muhammad Muslim
    Zakir, Anam
    GEMA ONLINE JOURNAL OF LANGUAGE STUDIES, 2023, 23 (01): : 74 - 92
  • [24] RUSAS: Roman Urdu Sentiment Analysis System
    Jawad, Kazim
    Ahmad, Muhammad
    Alvi, Majdah
    Alvi, Muhammad Bux
    CMC-COMPUTERS MATERIALS & CONTINUA, 2024, 79 (01): : 1463 - 1480
  • [25] RUTUT: Roman Urdu to Urdu Translator Based on Character Substitution Rules and Unicode Mapping
    Shahroz, Mobeen
    Mushtaq, Muhammad Faheem
    Mehmood, Arif
    Ullah, Saleem
    Choi, Gyu Sang
    IEEE ACCESS, 2020, 8 : 189823 - 189841
  • [26] Multilingual Detection of Cyberbullying in Mixed Urdu, Roman Urdu, and English Social Media Conversations
    Razi, Fakhra
    Ejaz, Naveed
    IEEE ACCESS, 2024, 12 : 105201 - 105210
  • [27] SPEAKER NORMALIZATION IN PERCEPTION OF LEXICAL TONE
    LEATHER, J
    JOURNAL OF PHONETICS, 1983, 11 (04) : 373 - 382
  • [28] Is vowel normalization independent of lexical processing?
    Mitterer, Holger
    PHONETICA, 2006, 63 (04) : 209 - 229
  • [29] Lexical Normalization for Social Media Text
    Han, Bo
    Cook, Paul
    Baldwin, Timothy
    ACM TRANSACTIONS ON INTELLIGENT SYSTEMS AND TECHNOLOGY, 2013, 4 (01)
  • [30] Tools for nominalization: An alternative for lexical normalization
    Insaurriaga Gonzalez, Marco Antonio
    de Lima, Vera Lucia Strube
    de Lima, Jose Valdeni
    COMPUTATIONAL PROCESSING OF THE PORTUGUESE LANGUAGE, PROCEEDINGS, 2006, 3960 : 100 - 109