Resource Creation for Training and Testing of Transliteration Systems for Indian Languages

被引:0
|
作者
Sowmya, V. B. [1 ]
Choudhury, Monojit [1 ]
Bali, Kalika [1 ]
Dasgupta, Tirthankar [2 ]
Basu, Anupam [2 ]
机构
[1] Microsoft Res Lab India, Bangalore, Karnataka, India
[2] Soc Nat Language Technol Res, Kolkata, India
关键词
D O I
暂无
中图分类号
H [语言、文字];
学科分类号
05 ;
摘要
Machine transliteration is used in a number of NLP applications ranging from machine translation and information retrieval to input mechanisms for non-roman scripts. Many popular Input Method Editors for Indian languages, like Baraha, Akshara, Quillpad etc, use back-transliteration as a mechanism to allow users to input text in a number of Indian language. The lack of a standard dataset to evaluate these systems makes it difficult to make any meaningful comparisons of their relative accuracies. In this paper, we describe the methodology for the creation of a dataset of similar to 2500 transliterated sentence pairs each in Bangla, Hindi and Telugu. The data was collected across three different modes from a total of 60 users. We believe that this dataset will prove useful not only for the evaluation and training of back-transliteration systems but also help in the linguistic analysis of the process of transliterating Indian languages from native scripts to Roman.
引用
收藏
页码:2902 / 2907
页数:6
相关论文
共 50 条
  • [1] A Study on Transliteration Techniques and Conventional Transliteration Schemes for Indian Languages
    Nair, Jayashree
    Ahammed, Riyaz
    Shaji, Anakha
    [J]. SUSTAINABLE COMMUNICATION NETWORKS AND APPLICATION, ICSCN 2021, 2022, 93 : 103 - 117
  • [2] Transliteration for resource-scarce languages
    Chinnakotla M.K.
    Damani O.P.
    Satoskar A.
    [J]. ACM Transactions on Asian Language Information Processing, 2010, 9 (04):
  • [3] Simple approach for building transliteration editors for Indian languages
    Prahallad L.
    Prahallad K.
    Ganapathiraju M.
    [J]. Journal of Zhejiang University-SCIENCE A, 2005, 6 (11): : 1354 - 1361
  • [4] A simple approach for building transliteration editors for Indian languages
    PRAHALLAD Lavanya
    PRAHALLAD Kishore
    GANAPATHIRAJU Madhavi
    [J]. Journal of Zhejiang University-Science A(Applied Physics & Engineering), 2005, (11) : 188 - 195
  • [5] Bootstrapping Transliteration with Constrained Discovery for Low-Resource Languages
    Upadhyay, Shyam
    Kodner, Jordan
    Roth, Dan
    [J]. 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 501 - 511
  • [6] A review of machine transliteration, translation, evaluation metrics and datasets in Indian Languages
    Jha, Abhinav
    Patil, Hemprasad Yashwant
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (15) : 23509 - 23540
  • [7] A review of machine transliteration, translation, evaluation metrics and datasets in Indian Languages
    Abhinav Jha
    Hemprasad Yashwant Patil
    [J]. Multimedia Tools and Applications, 2023, 82 : 23509 - 23540
  • [8] Phonology-Augmented Statistical Transliteration for Low-Resource Languages
    Hoang Gia Ngo
    Chen, Nancy F.
    Nguyen Binh Minh
    Ma, Bin
    Li, Haizhou
    [J]. 16TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION (INTERSPEECH 2015), VOLS 1-5, 2015, : 3670 - 3674
  • [9] Resource creation and development of an English-Bangla back transliteration system
    Dasgupta, Tirthankar
    Sinha, Manjira
    Basu, Anupam
    [J]. INTERNATIONAL JOURNAL OF KNOWLEDGE-BASED AND INTELLIGENT ENGINEERING SYSTEMS, 2015, 19 (01) : 35 - 46
  • [10] A Comparative Study of Extremely Low-Resource Transliteration of the World's Languages
    Wu, Winston
    Yarowsky, David
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 938 - 943