Manipuri-English comparable corpus for cross-lingual studies

被引:1
|
作者
Laitonjam, Lenin [1 ,2 ]
Singh, Sanasam Ranbir [1 ]
机构
[1] Indian Inst Technol Guwahati, Dept Comp Sci & Engn, Gauhati, Assam, India
[2] Natl Inst Technol Mizoram, Dept Comp Sci & Engn, Aizawl, India
关键词
Manipuri; Low-resource; Comparable corpus; Bilingual dictionary induction; Machine translation; GENERATION;
D O I
10.1007/s10579-021-09576-y
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper presents Mni-EnCC, a temporal alligned Manipuri-English comparable corpus, to facilitate cross-lingual studies between Manipuri and English. Mni-EnCC has been created by collating text from two publicly published news sources in internet namely Sangai Express and Poknapham in Manipur. Though, both the publishers publish news in Manipuri and English editions, they are not the translation of each other. Almost all of the Manipuri editions are created using proprietary tools which generate texts in customized non-standard and non-unicode encodings. We develop tools to transform the non-unicode text into unicode text to generate the Manipuri corpus. We then verify and time aligned all the articles using a semi-automated process. Furthermore, the quality of the Mni-EnCC is evaluated using two premier cross-lingual studies: bilingual dictionary induction and machine translation. Experimental observations provide encouraging results making it as a suitable dataset for future cross-lingual studies on between Manipuri and English language pair. With an objective to promote cross-lingual studies in Manipuri-English, we also plan to release the corpus and supporting Unicode conversion tool.
引用
收藏
页码:377 / 413
页数:37
相关论文
共 50 条
  • [31] English-to-Korean Cross-Lingual Link Detection for Wikipedia
    Marigomen, Ralph
    Kang, In-Su
    U- AND E-SERVICE, SCIENCE AND TECHNOLOGY, 2011, 264 : 274 - 280
  • [32] English-Malayalam Cross-Lingual Information Retrieval - an experience
    Nikesh, P. L.
    Sumam, Mary Idicula
    David, Peter S.
    2008 IEEE INTERNATIONAL CONFERENCE ON ELECTRO/INFORMATION TECHNOLOGY, 2008, : 271 - 275
  • [33] Leveraging Synthetic Data for Improved Manipuri-English Code-Switched ASR
    Singh, Naorem Karline
    Madal, Wangkheimayum
    Devi, Chingakham Neeta
    Pangsatabam, Hoomexsun
    Chanu, Yambem Jina
    IEEE ACCESS, 2025, 13 : 25723 - 25740
  • [34] Reinforced Transformer with Cross-Lingual Distillation for Cross-Lingual Aspect Sentiment Classification
    Wu, Hanqian
    Wang, Zhike
    Qing, Feng
    Li, Shoushan
    ELECTRONICS, 2021, 10 (03) : 1 - 14
  • [35] ECCParaCorp: a cross-lingual parallel corpus towards cancer education, dissemination and application
    Ma, Hetong
    Yang, Feihong
    Ren, Jiansong
    Li, Ni
    Dai, Min
    Wang, Xuwen
    Fang, An
    Li, Jiao
    Qian, Qing
    He, Jie
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2020, 20 (Suppl 3)
  • [36] Cross-Lingual Blog Analysis by Cross-Lingual Comparison of Characteristic Terms and Blog Posts
    Nakasaki, Hiroyuki
    Kawaba, Mariko
    Utsuro, Takehito
    Fukuhara, Tomohiro
    Nakagawa, Hiroshi
    Kando, Noriko
    PROCEEDINGS OF THE SECOND INTERNATIONAL SYMPOSIUM ON UNIVERSAL COMMUNICATION, 2008, : 105 - +
  • [37] ECCParaCorp: a cross-lingual parallel corpus towards cancer education, dissemination and application
    Hetong Ma
    Feihong Yang
    Jiansong Ren
    Ni Li
    Min Dai
    Xuwen Wang
    An Fang
    Jiao Li
    Qing Qian
    Jie He
    BMC Medical Informatics and Decision Making, 20
  • [38] A fast forward approach to cross-lingual question answering for English and German
    Stroetgen, Robert
    Mandl, Thomas
    Schneider, Rene
    ACCESSING MULTILINGUAL INFORMATION REPOSITORIES, 2006, 4022 : 332 - 336
  • [39] A new semantically annotated corpus with syntactic-semantic and cross-lingual senses
    Rakho, Myriam
    Laporte, Eric
    Constant, Matthieu
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 597 - 600
  • [40] Cross-lingual semantic annotation of biomedical literature: experiments in Spanish and English
    Perez, Naiara
    Accuosto, Pablo
    Bravo, Alex
    Cuadros, Montse
    Martinez-Garcia, Eva
    Saggion, Horacio
    Rigau, German
    BIOINFORMATICS, 2020, 36 (06) : 1872 - 1880