Transliteration Characteristics in Romanized Assamese Language Social Media Text and Machine Transliteration

被引:0
|
作者
Baruah, Hemanta [1 ]
Singh, Sanasam Ranbir [1 ]
Sarmah, Priyankoo [1 ]
机构
[1] Indian Inst Technol Guwahati, Gauhati 781039, Assam, India
关键词
Transliteration; grapheme; phoneme; PBSMT; BiLSTM; attention; transformer;
D O I
10.1145/3639565
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This article aims to understand different transliteration behaviors of Romanized Assamese text on social media. Assamese, a language that belongs to the Indo-Aryan language family, is also among the 22 scheduled languages in India. With the increasing popularity of social media in India and also the common use of the EnglishQwerty keyboard, Indian users on socialmedia express themselves in their native languages, but using the Roman/Latin script. Unlike some other popular South Asian languages (say Pinyin for Chinese), Indian languages do not have a common standard romanization convention for writing on social media platforms. Assamese and English are two very different orthographical languages. Thus, considering both orthographic and phonemic characteristics of the language, this study tries to explain how Assamese vowels, vowel diacritics, and consonants are represented in Roman transliterated form. From a dataset of romanized Assamese social media texts collected from three popular social media sites: (Facebook, YouTube, and X (formerly known as Twitter)),1 we have manually labeled them with their native Assamese script. A comparison analysis is also carried out between the transliterated Assamese social media texts with six different Assamese romanization schemes that reflect how Assamese users on social media do not adhere to any fixed romanization scheme. We have built three separate character-level transliteration models from our dataset. One using a traditional phrase-based statistical machine transliteration model, (1) PBSMT model and two separate neural transliteration models, (2) BiLSTM neural seq2seq model with attention, and (3) Neural transformer model. A thorough error analysis has been performed on the transliteration result obtained from the three state-of-the-art models mentioned above. This may help to build a more robust machine transliteration system for the Assamese social media domain in the future. Finally, an attention analysis experiment is also carried out with the help of attention weight scores taken from the character-level BiLSTM neural seq2seq transliteration model built from our dataset.
引用
收藏
页数:36
相关论文
共 50 条
  • [1] Arabic Transliteration of Romanized Tunisian Dialect Text: A Preliminary Investigation
    Masmoudi, Abir
    Habash, Nizar
    Ellouze, Mariem
    Esteve, Yannick
    Belguith, Lamia Hadrich
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING (CICLING 2015), PT I, 2015, 9041 : 608 - 619
  • [2] Language Identification for Social Media: Short Messages and Transliteration
    Cardoso, Pedro Miguel Dias
    Roy, Anindya
    PROCEEDINGS OF THE 25TH INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'16 COMPANION), 2016, : 611 - 614
  • [3] Machine transliteration and transliterated text retrieval: a survey
    Dinesh Kumar Prabhakar
    Sukomal Pal
    Sādhanā, 2018, 43
  • [4] Machine transliteration and transliterated text retrieval: a survey
    Prabhakar, Dinesh Kumar
    Pal, Sukomal
    SADHANA-ACADEMY PROCEEDINGS IN ENGINEERING SCIENCES, 2018, 43 (06):
  • [5] Language Identification and Transliteration approaches for Code-Mixed Text
    Kumbhar M.
    Thakre K.
    Journal of Engineering Science and Technology Review, 2024, 17 (01) : 63 - 70
  • [6] Hatred and trolling detection transliteration framework using hierarchical LSTM in code-mixed social media text
    Shekhar, Shashi
    Garg, Hitendra
    Agrawal, Rohit
    Shivani, Shivendra
    Sharma, Bhisham
    COMPLEX & INTELLIGENT SYSTEMS, 2023, 9 (03) : 2813 - 2826
  • [7] Hatred and trolling detection transliteration framework using hierarchical LSTM in code-mixed social media text
    Shashi Shekhar
    Hitendra Garg
    Rohit Agrawal
    Shivendra Shivani
    Bhisham Sharma
    Complex & Intelligent Systems, 2023, 9 : 2813 - 2826
  • [8] Offering Language Based Services on Social Media by Identifying User's Preferred Language(s) from Romanized Text
    Khapra, Mitesh M.
    Joshi, Salil
    Ramanathan, Ananthakrishnan
    Visweswariah, Karthik
    PROCEEDINGS OF THE 22ND INTERNATIONAL CONFERENCE ON WORLD WIDE WEB (WWW'13 COMPANION), 2013, : 71 - 72
  • [9] Generating transliteration rules for cross-language information retrieval from machine translation dictionaries
    Sakai, Tetsuya
    Kumano, Akira
    Manabe, Toshihiko
    Proceedings of the IEEE International Conference on Systems, Man and Cybernetics, 2002, 6 : 290 - 295
  • [10] Cross-language information retrieval for poetry form of literature-based on machine transliteration using CNN
    Jadhav, Ranjana S.
    Dhore, Manikrao Laxmanrao
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 45 (02) : 3025 - 3037