Manipuri-English comparable corpus for cross-lingual studies

被引:1
|
作者
Laitonjam, Lenin [1 ,2 ]
Singh, Sanasam Ranbir [1 ]
机构
[1] Indian Inst Technol Guwahati, Dept Comp Sci & Engn, Gauhati, Assam, India
[2] Natl Inst Technol Mizoram, Dept Comp Sci & Engn, Aizawl, India
关键词
Manipuri; Low-resource; Comparable corpus; Bilingual dictionary induction; Machine translation; GENERATION;
D O I
10.1007/s10579-021-09576-y
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper presents Mni-EnCC, a temporal alligned Manipuri-English comparable corpus, to facilitate cross-lingual studies between Manipuri and English. Mni-EnCC has been created by collating text from two publicly published news sources in internet namely Sangai Express and Poknapham in Manipur. Though, both the publishers publish news in Manipuri and English editions, they are not the translation of each other. Almost all of the Manipuri editions are created using proprietary tools which generate texts in customized non-standard and non-unicode encodings. We develop tools to transform the non-unicode text into unicode text to generate the Manipuri corpus. We then verify and time aligned all the articles using a semi-automated process. Furthermore, the quality of the Mni-EnCC is evaluated using two premier cross-lingual studies: bilingual dictionary induction and machine translation. Experimental observations provide encouraging results making it as a suitable dataset for future cross-lingual studies on between Manipuri and English language pair. With an objective to promote cross-lingual studies in Manipuri-English, we also plan to release the corpus and supporting Unicode conversion tool.
引用
收藏
页码:377 / 413
页数:37
相关论文
共 50 条
  • [41] English to Hindi Cross-Lingual Text Summarizer using TextRank Algorithm
    Rawat, Sunita
    Kalambe, Kavita
    Jaywant, Sagarika
    Werulkar, Lakshita
    Barbate, Mukul
    Jaiswal, Tarrun
    INTERNATIONAL JOURNAL OF NEXT-GENERATION COMPUTING, 2023, 14 (01): : 238 - 245
  • [42] ArbEngVec : Arabic-English Cross-Lingual Word Embedding Model
    Lachraf, Raki
    Nagoudi, El Moatez Billah
    Ayachi, Youcef
    Abdelali, Ahmed
    Schwab, Didier
    FOURTH ARABIC NATURAL LANGUAGE PROCESSING WORKSHOP (WANLP 2019), 2019, : 40 - 48
  • [43] SimCSum: Joint Learning of Simplification and Cross-lingual Summarization for Cross-lingual Science Journalism
    Fatima, Mehwish
    Kolber, Tim
    Markert, Katja
    Strube, Michael
    NewSumm 2023 - Proceedings of the 4th New Frontiers in Summarization Workshop, Proceedings of EMNLP Workshop, 2023, : 24 - 40
  • [44] A pilot study of English selectional preferences and their cross-lingual compatibility with Basque
    Agirre, E
    Aldezabal, I
    Pociello, E
    TEXT, SPEECH AND DIALOGUE, PROCEEDINGS, 2003, 2807 : 12 - 19
  • [45] Cross-lingual Emotion Detection
    Hassan, Sabit
    Shaar, Shaden
    Darwish, Kareem
    2022 Language Resources and Evaluation Conference, LREC 2022, 2022, : 6948 - 6958
  • [46] Cross-lingual talker discrimination
    Wester, Mirjam
    11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, 2010, : 1253 - 1256
  • [47] Cross-Lingual Word Embeddings
    Søgaard A.
    Vulić I.
    Ruder S.
    Faruqui M.
    Synthesis Lectures on Human Language Technologies, 2019, 12 (02): : 1 - 132
  • [48] Cross-lingual Continual Learning
    M'hamdi, Meryem
    Ren, Xiang
    May, Jonathan
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 3908 - 3943
  • [49] Cross-Lingual Phrase Retrieval
    Zheng, Heqi
    Zhang, Xiao
    Chi, Zewen
    Huang, Heyan
    Yan, Tan
    Lan, Tian
    Wei, Wei
    Mao, Xian-Ling
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 4193 - 4204
  • [50] Cross-lingual timeline summarization
    Cagliero, Luca
    La Quatra, Moreno
    Garza, Paolo
    Baralis, Elena
    2021 IEEE FOURTH INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND KNOWLEDGE ENGINEERING (AIKE 2021), 2021, : 45 - 53