Parallel sentence generation from comparable corpora for improved SMT

被引:23
|
作者
Rauf, Sadaf Abdul [1 ]
Schwenk, Holger [1 ]
机构
[1] Univ Le Mans, LIUM, Le Mans 9, France
关键词
Statistical machine translation (SMT); Comparable corpus; Non-parallel corpus; Information retrieval (IR); WER; TER; TERp; Arabic-English; French-English; Sentence tail removal;
D O I
10.1007/s10590-011-9114-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Aparallel corpus is an essential resource for statistical machine translation (SMT) but is often not available in the required amounts for all domains and languages. An approach is presented here-which aims at producing parallel corpora from available comparable corpora. An SMT system is used to translate the source-language part of a comparable corpus and the translations are used as queries to conduct information retrieval from the target-language side of the comparable corpus. Simple filters are then used to score the SMT output and the IR-returned sentence with the filter score defining the degree of similarity between the two. Using SMT system output gives the benefit of trying to correct one of the common errors by sentence tail removal. The approach was applied to Arabic-English and French-English systems using comparable news corpora and considerable improvements were achieved in the BLEU score. We show that our approach is independent of the quality of the SMT system used to make the queries, strengthening the claim of applicability of the approach for languages and domains with limited parallel corpora available to start with. We compare our approach with one of the earlier approaches and show that our approach is easier to implement and gives equally good improvements.
引用
收藏
页码:341 / 375
页数:35
相关论文
共 50 条
  • [21] Augmenting SMT with Generated Pseudo-parallel Corpora from Monolingual News Resources
    Wolk, Krzysztof
    Wolk, Agnieszka
    [J]. RECENT ADVANCES IN INFORMATION SYSTEMS AND TECHNOLOGIES, VOL 1, 2017, 569 : 308 - 316
  • [22] A Quantitative Analysis and Sentence Alignment for Parallel Corpora of ShiJi
    Liu, Ying
    Wang, Nan
    Yuan, Bo
    [J]. JOURNAL OF QUANTITATIVE LINGUISTICS, 2016, 23 (01) : 71 - 108
  • [23] Context-based sentence alignment in parallel corpora
    Bicici, Ergun
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2008, 4919 : 434 - 444
  • [24] Sentence Level Alignment of Digitized Books Parallel Corpora
    Laukaitis, Algirdas
    Plikynas, Darius
    Ostasius, Egidijus
    [J]. INFORMATICA, 2018, 29 (04) : 693 - 710
  • [25] Document Alignment for Generation of English-Punjabi Comparable Corpora from Wikipedia
    Goyal, Vishal
    Kumar, Ajit
    Lehal, Manpreet Singh
    [J]. INTERNATIONAL JOURNAL OF E-ADOPTION, 2020, 12 (01) : 42 - 51
  • [26] Identification of Comparable Argument-Head Relations in Parallel Corpora
    Spreyer, Kathrin
    Kuhn, Jonas
    Schrader, Bettina
    [J]. SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1860 - 1866
  • [27] Principled Paraphrase Generation with Parallel Corpora
    Ormazabal, Aitor
    Artetxe, Mikel
    Soroa, Aitor
    Labaka, Gorka
    Agirre, Eneko
    [J]. PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 1621 - 1638
  • [28] Inversion transduction grammar constraints for mining parallel sentences from quasi-comparable corpora
    Wu, DK
    Fung, P
    [J]. NATURAL LANGUAGE PROCESSING - IJCNLP 2005, PROCEEDINGS, 2005, 3651 : 257 - 268
  • [29] Chinese-Portuguese Machine Translation: A Study on Building Parallel Corpora from Comparable Texts
    Liu, Siyou
    Wang, Longyue
    Liu, Chao-Hong
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 1485 - 1492
  • [30] Iterative Sentence-Pair Extraction from Quasi-Parallel Corpora for Machine Translation
    Sarikaya, R.
    Maskey, S.
    Zhang, R.
    Jan, E.
    Wang, D.
    Ramabhadran, B.
    Roukos, S.
    [J]. INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, 2009, : 432 - 435