Parallel sentence generation from comparable corpora for improved SMT

被引:23
|
作者
Rauf, Sadaf Abdul [1 ]
Schwenk, Holger [1 ]
机构
[1] Univ Le Mans, LIUM, Le Mans 9, France
关键词
Statistical machine translation (SMT); Comparable corpus; Non-parallel corpus; Information retrieval (IR); WER; TER; TERp; Arabic-English; French-English; Sentence tail removal;
D O I
10.1007/s10590-011-9114-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Aparallel corpus is an essential resource for statistical machine translation (SMT) but is often not available in the required amounts for all domains and languages. An approach is presented here-which aims at producing parallel corpora from available comparable corpora. An SMT system is used to translate the source-language part of a comparable corpus and the translations are used as queries to conduct information retrieval from the target-language side of the comparable corpus. Simple filters are then used to score the SMT output and the IR-returned sentence with the filter score defining the degree of similarity between the two. Using SMT system output gives the benefit of trying to correct one of the common errors by sentence tail removal. The approach was applied to Arabic-English and French-English systems using comparable news corpora and considerable improvements were achieved in the BLEU score. We show that our approach is independent of the quality of the SMT system used to make the queries, strengthening the claim of applicability of the approach for languages and domains with limited parallel corpora available to start with. We compare our approach with one of the earlier approaches and show that our approach is easier to implement and gives equally good improvements.
引用
收藏
页码:341 / 375
页数:35
相关论文
共 50 条
  • [1] Parallel Sentence Alignment from Biomedical Comparable Corpora
    Cardon, Remi
    Grabar, Natalia
    [J]. DIGITAL PERSONALIZED HEALTH AND MEDICINE, 2020, 270 : 362 - 366
  • [2] Improved machine translation performance via parallel sentence extraction from comparable corpora
    Munteanu, DS
    Fraser, A
    Marcu, D
    [J]. HLT-NAACL 2004: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, 2004, : 265 - 272
  • [3] PEXACC: A Parallel Sentence Mining Algorithm from Comparable Corpora
    Ion, Radu
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 2181 - 2188
  • [4] Parallel Sentence Extraction from Comparable Corpora with Neural Network Features
    Chu, Chenhui
    Dabre, Raj
    Kurohashi, Sadao
    [J]. LREC 2016 - TENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2016, : 2931 - 2935
  • [5] A Multilingual Dataset for Evaluating Parallel Sentence Extraction from Comparable Corpora
    Zweigenbaum, Pierre
    Sharoff, Serge
    Rapp, Reinhard
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3828 - 3833
  • [6] Sentence alignment for monolingual comparable corpora
    Barzilay, R
    Elhadad, N
    [J]. PROCEEDINGS OF THE 2003 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, 2003, : 25 - 32
  • [7] Extracting Parallel Phrases from Comparable Corpora
    Zhang, Jiexin
    Cao, Hailong
    Zhao, Tiejun
    [J]. PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2014), 2014, : 166 - 169
  • [8] Building subject-aligned comparable corpora and mining it for truly parallel sentence pairs
    Wolk, Krzysztof
    Marasek, Krzysztof
    [J]. INTERNATIONAL WORKSHOP ON INNOVATIONS IN INFORMATION AND COMMUNICATION SCIENCE AND TECHNOLOGY, IICST 2014, 2014, 18 : 126 - 132
  • [9] Integrated Parallel Sentence and Fragment Extraction from Comparable Corpora: A Case Study on Chinese-Japanese Wikipedia
    Chu, Chenhui
    Nakazawa, Toshiaki
    Kurohashi, Sadao
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2016, 15 (02)
  • [10] A Holistic Approach to Bilingual Sentence Fragment Extraction from Comparable Corpora
    Khademian, Mahdi
    Taghipour, Kaveh
    Mansour, Saab
    Khadivi, Shahram
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 4073 - 4079