Manipuri-English comparable corpus for cross-lingual studies

被引:1
|
作者
Laitonjam, Lenin [1 ,2 ]
Singh, Sanasam Ranbir [1 ]
机构
[1] Indian Inst Technol Guwahati, Dept Comp Sci & Engn, Gauhati, Assam, India
[2] Natl Inst Technol Mizoram, Dept Comp Sci & Engn, Aizawl, India
关键词
Manipuri; Low-resource; Comparable corpus; Bilingual dictionary induction; Machine translation; GENERATION;
D O I
10.1007/s10579-021-09576-y
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
This paper presents Mni-EnCC, a temporal alligned Manipuri-English comparable corpus, to facilitate cross-lingual studies between Manipuri and English. Mni-EnCC has been created by collating text from two publicly published news sources in internet namely Sangai Express and Poknapham in Manipur. Though, both the publishers publish news in Manipuri and English editions, they are not the translation of each other. Almost all of the Manipuri editions are created using proprietary tools which generate texts in customized non-standard and non-unicode encodings. We develop tools to transform the non-unicode text into unicode text to generate the Manipuri corpus. We then verify and time aligned all the articles using a semi-automated process. Furthermore, the quality of the Mni-EnCC is evaluated using two premier cross-lingual studies: bilingual dictionary induction and machine translation. Experimental observations provide encouraging results making it as a suitable dataset for future cross-lingual studies on between Manipuri and English language pair. With an objective to promote cross-lingual studies in Manipuri-English, we also plan to release the corpus and supporting Unicode conversion tool.
引用
收藏
页码:377 / 413
页数:37
相关论文
共 50 条
  • [1] Manipuri–English comparable corpus for cross-lingual studies
    Lenin Laitonjam
    Sanasam Ranbir Singh
    Language Resources and Evaluation, 2023, 57 : 377 - 413
  • [2] Manipuri-English Cross-lingual Word Embeddings using a Temporally Aligned Comparable Corpus
    Laitonjam, Lenin
    Singh, Sanasam Ranbir
    2021 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2021, : 195 - 199
  • [3] Cross-lingual pseudo-relevance feedback using a comparable corpus
    Rogati, M
    Yang, YM
    EVLAUATION OF CROSS-LANGUAGE INFORMATION RETRIEVAL SYSTEMS, 2002, 2406 : 151 - 157
  • [4] CLTC: A Chinese-English Cross-lingual Topic Corpus
    Xia, Yunqing
    Tang, Guoyu
    Jin, Peng
    Yang, Xia
    LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 532 - 537
  • [5] A method of Chinese and Thai cross-lingual query expansion based on comparable corpus
    Tang P.
    Zhao J.
    Yu Z.
    Wang Z.
    Xian Y.
    Yu, Zhengtao (ztyu@hotmail.com), 2017, Korea Information Processing Society (13): : 805 - 817
  • [6] The application of the comparable corpora in Chinese-English Cross-Lingual Information Retrieval
    Du, L
    Zhang, YB
    Sun, L
    Sun, YF
    JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2001, 16 (04) : 351 - 358
  • [7] The Application of the Comparable Corpora in Chinese-English Cross-Lingual Information Retrieval
    杜林
    张毅波
    孙乐
    孙玉芳
    Journal of Computer Science and Technology, 2001, (04) : 351 - 358
  • [8] The application of the comparable corpora in Chinese-English Cross-Lingual Information Retrieval
    Lin Du
    Yibo Zhang
    Le Sun
    Yufang Sun
    Journal of Computer Science and Technology, 2001, 16 : 351 - 358
  • [9] Cross-Lingual Semantic Similarity Measure for Comparable Articles
    Saad, Motaz
    Langlois, David
    Smaili, Kamel
    ADVANCES IN NATURAL LANGUAGE PROCESSING, 2014, 8686 : 105 - +
  • [10] Cross-lingual semantic similarity measure for comparable articles
    Saad, Motaz
    Langlois, David
    Smaïli, Kamel
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 2014, 8686 : 105 - 115