Manipuri-English comparable corpus for cross-lingual studies

被引:1
|
作者
Laitonjam, Lenin [1 ,2 ]
Singh, Sanasam Ranbir [1 ]
机构
[1] Indian Inst Technol Guwahati, Dept Comp Sci & Engn, Gauhati, Assam, India
[2] Natl Inst Technol Mizoram, Dept Comp Sci & Engn, Aizawl, India
关键词
Manipuri; Low-resource; Comparable corpus; Bilingual dictionary induction; Machine translation; GENERATION;
D O I
10.1007/s10579-021-09576-y
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper presents Mni-EnCC, a temporal alligned Manipuri-English comparable corpus, to facilitate cross-lingual studies between Manipuri and English. Mni-EnCC has been created by collating text from two publicly published news sources in internet namely Sangai Express and Poknapham in Manipur. Though, both the publishers publish news in Manipuri and English editions, they are not the translation of each other. Almost all of the Manipuri editions are created using proprietary tools which generate texts in customized non-standard and non-unicode encodings. We develop tools to transform the non-unicode text into unicode text to generate the Manipuri corpus. We then verify and time aligned all the articles using a semi-automated process. Furthermore, the quality of the Mni-EnCC is evaluated using two premier cross-lingual studies: bilingual dictionary induction and machine translation. Experimental observations provide encouraging results making it as a suitable dataset for future cross-lingual studies on between Manipuri and English language pair. With an objective to promote cross-lingual studies in Manipuri-English, we also plan to release the corpus and supporting Unicode conversion tool.
引用
收藏
页码:377 / 413
页数:37
相关论文
共 50 条
  • [21] Learning Cross-Lingual IR from an English Retriever
    Li, Yulong
    Franz, Martin
    Sultan, Md Arafat
    Iyer, Bhavani
    Lee, Young-Suk
    Sil, Avirup
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 4428 - 4436
  • [22] Cross-lingual document similarity estimation and dictionary generation with comparable corpora
    Stajner, Tadej
    Mladenic, Dunja
    KNOWLEDGE AND INFORMATION SYSTEMS, 2019, 58 (03) : 729 - 743
  • [23] Cross-lingual document similarity estimation and dictionary generation with comparable corpora
    Tadej Štajner
    Dunja Mladenić
    Knowledge and Information Systems, 2019, 58 : 729 - 743
  • [24] Cross-Lingual Sentiment Relation Capturing for Cross-Lingual Sentiment Analysis
    Chen, Qiang
    Li, Wenjie
    Lei, Yu
    Liu, Xule
    Luo, Chuwei
    He, Yanxiang
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2017, 2017, 10193 : 54 - 67
  • [25] Development of Sentiment Lexicon in Bengali utilizing Corpus and Cross-lingual Resources
    Sazzed, Salim
    2020 IEEE 21ST INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE (IRI 2020), 2020, : 237 - 244
  • [26] Using the Web corpus to translate the queries in cross-lingual information retrieval
    Zhang, JL
    Sun, L
    Min, JM
    PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), 2005, : 493 - 498
  • [27] CROSS-LINGUAL AND MULTILINGUAL SPEECH EMOTION RECOGNITION ON ENGLISH AND FRENCH
    Neumann, Michael
    Ngoc Thang Vu
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 5769 - 5773
  • [28] Cross-lingual Dysarthria Severity Classification for English, Korean, and Tamil
    Yeo, Eun Jung
    Choi, Kwanghee
    Kim, Sunhee
    Chung, Minhwa
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 566 - 574
  • [29] English and Malay Cross-lingual Sentiment Lexicon Acquisition and Analysis
    Nasharuddin, Nurul Amelina
    Abdullah, Muhamad Taufik
    Azman, Azreen
    Kadir, Rabiah Abdul
    INFORMATION SCIENCE AND APPLICATIONS 2017, ICISA 2017, 2017, 424 : 467 - 475
  • [30] Cross-lingual Romanian to English question answering at CLEF 2006
    Puscasu, Georgiana
    Iftene, Adrian
    Pistol, Ionut
    Trandabat, Diana
    Tufis, Dan
    Ceausu, Alin
    Stefanescu, Dan
    Ion, Radu
    Dornescu, Lustin
    Moruz, Alex
    Cristea, Dan
    EVALUATION OF MULTILINGUAL AND MULTI-MODAL INFORMATION RETRIEVAL, 2007, 4730 : 385 - +