Document Alignment for Generation of English-Punjabi Comparable Corpora from Wikipedia

被引:2
|
作者
Goyal, Vishal [1 ]
Kumar, Ajit [2 ]
Lehal, Manpreet Singh [1 ]
机构
[1] Punjabi Univ, Patiala, Punjab, India
[2] Multani Mal Modi Coll, Patiala, Punjab, India
关键词
Comparable Corpora; Document Alignment; NLP; SMT; Wikipedia;
D O I
10.4018/IJEA.2020010104
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
Comparable corpora come as an alternative to parallel corpora for the languages where the parallel corpora is scarce. The efficiency of the models trained on comparable corpora is comparatively less to that of the parallel corpora however it helps to compensate much to the machine translation. In this article, the authors have explored Wikipedia as a potential source and delineated the process of alignment of documents which will be further used for the extraction of parallel data. The parallel data thus extracted will help to enhance the performance of Statistical Machine translation.
引用
收藏
页码:42 / 51
页数:10
相关论文
共 19 条
  • [1] Building English - Punjabi Aligned Parallel Corpora of Nouns from Comparable Corpora
    Kaur, Dilshad
    Singh, Satwinder
    [J]. APPLIED COMPUTER SYSTEMS, 2023, 28 (02) : 245 - 251
  • [2] Document and Sentence Alignment in Comparable Corpora Using Bipartite Graph Matching
    Rahimi, Zeinab
    Taghipour, Kaveh
    Khadivi, Shahram
    Afhami, Nasim
    [J]. 2012 SIXTH INTERNATIONAL SYMPOSIUM ON TELECOMMUNICATIONS (IST), 2012, : 817 - 821
  • [3] Bilingual term alignment from comparable corpora in English discharge summary and Chinese discharge summary
    Yan Xu
    Luoxin Chen
    Junsheng Wei
    Sophia Ananiadou
    Yubo Fan
    Yi Qian
    Eric I-Chao Chang
    Junichi Tsujii
    [J]. BMC Bioinformatics, 16
  • [4] Bilingual term alignment from comparable corpora in English discharge summary and Chinese discharge summary
    Xu, Yan
    Chen, Luoxin
    Wei, Junsheng
    Ananiadou, Sophia
    Fan, Yubo
    Qian, Yi
    Chang, Eric I-Chao
    Tsujii, Junichi
    [J]. BMC BIOINFORMATICS, 2015, 16
  • [5] Parallel Sentence Alignment from Biomedical Comparable Corpora
    Cardon, Remi
    Grabar, Natalia
    [J]. DIGITAL PERSONALIZED HEALTH AND MEDICINE, 2020, 270 : 362 - 366
  • [6] Cross-lingual document similarity estimation and dictionary generation with comparable corpora
    Tadej Štajner
    Dunja Mladenić
    [J]. Knowledge and Information Systems, 2019, 58 : 729 - 743
  • [7] Cross-lingual document similarity estimation and dictionary generation with comparable corpora
    Stajner, Tadej
    Mladenic, Dunja
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2019, 58 (03) : 729 - 743
  • [8] French-English terminology extraction from comparable corpora
    Daille, B
    Morin, E
    [J]. NATURAL LANGUAGE PROCESSING - IJCNLP 2005, PROCEEDINGS, 2005, 3651 : 707 - 718
  • [9] Between Comparable and Parallel: English-Czech Corpus from Wikipedia
    Stromajerova, Adela
    Baisa, Vit
    Blahus, Marek
    [J]. RECENT ADVANCES IN SLAVONIC NATURAL LANGUAGE PROCESSING (RASLAN 2016), 2016, : 3 - 8
  • [10] Parallel sentence generation from comparable corpora for improved SMT
    Rauf, Sadaf Abdul
    Schwenk, Holger
    [J]. MACHINE TRANSLATION, 2011, 25 (04) : 341 - 375