Evaluating the English-Turkish parallel treebank for machine translation

被引:0
|
作者
Gorgun, Onur [1 ,2 ]
Yildiz, Olcay Taner [3 ]
机构
[1] Isik Univ, Fac Engn & Nat Sci, Dept Comp Engn, Istanbul, Turkey
[2] Nokia, Res & Dev Ctr, Istanbul, Turkey
[3] Ozyegin Univ, Fac Engn, Comp Sci Dept, Istanbul, Turkey
关键词
Parallel treebank; parallel corpora; Turkish; English; syntax-based;
D O I
10.3906/elk-2102-57
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This study extends our initial efforts in building an English-Turkish parallel treebank corpus for statistical machine translation tasks. We manually generated parallel trees for about 17K sentences selected from the Penn Treebank corpus. English sentences vary in length: 15 to 50 tokens including punctuation. We constrained the translation of trees by (i) reordering of leaf nodes based on suffixation rules in Turkish, and (ii) gloss replacement. We aim to mimic human annotator's behavior in real translation task. In order to fill the morphological and syntactic gap between languages, we do morphological annotation and disambiguation. We also apply our heuristics by creating Nokia English-Turkish Treebank (NTB) to address technical document translation tasks. NTB also includes 8.3K sentences in varying lengths. We validate the corpus both extrinsically and intrinsically, and report our evaluation results regarding perplexity analysis and translation task results. Results prove that our heuristics yield promising results in terms of perplexity and are suitable for translation tasks in terms of BLEU scores.
引用
收藏
页码:184 / 199
页数:16
相关论文
共 50 条
  • [1] English-Turkish Literary Translation Through Human-Machine Interaction
    Sahin, Mehmet
    Gurses, Sabri
    [J]. TRADUMATICA-TRADUCCIO I TECNOLOGIES DE LA INFORMACIO I LA COMUNICACIO, 2021, (19): : 179 - 203
  • [2] Constructing a Turkish-English Parallel TreeBank
    Yildiz, Olcay Taner
    Solak, Ercan
    Gorgun, Onur
    Ehsani, Razieh
    [J]. PROCEEDINGS OF THE 52ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2, 2014, : 112 - 117
  • [3] The English-Swedish-Turkish Parallel Treebank
    Megyesi, Beata
    Dahlqvist, Bengt
    Csato, Eva A.
    Nivre, Joakim
    [J]. LREC 2010 - SEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2010, : 3393 - 3397
  • [4] Aligning Turkish and English parallel texts for statistical machine translation
    El-Kahlout, ID
    Oflazer, K
    [J]. COMPUTER AND INFORMATION SCIENCES - ISCIS 2005, PROCEEDINGS, 2005, 3733 : 616 - 625
  • [5] The Feasibility Analysis of Re-ranking for N-Best Lists on English-Turkish Machine Translation
    Yildirim, Ezgi
    Tantug, Ahmet Cuneyd
    [J]. 2013 IEEE INTERNATIONAL SYMPOSIUM ON INNOVATIONS IN INTELLIGENT SYSTEMS AND APPLICATIONS (IEEE INISTA), 2013,
  • [6] The English-Turkish Conflict of Mosul
    Von Elbe, Joachim
    [J]. KURDISH STUDIES, 2018, 6 (02) : 217 - 241
  • [7] Swedish-Turkish Parallel Treebank
    Megyesi, Beata
    Dahlqvist, Bengt
    Pettersson, Eva
    Nivre, Joakim
    [J]. SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 470 - 473
  • [8] Word Alignment for English-Turkish Language Pair
    Cakmak, M. Talha
    Acar, Suleyman
    Eryigit, Gulsen
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 2177 - 2180
  • [9] Evaluating Arabic to English Machine Translation
    Hadla, Laith S.
    Hailat, Taghreed M.
    Al-Kabi, Mohammed N.
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2014, 5 (11) : 68 - 73
  • [10] A LIST OF ENGLISH-TURKISH COGNATES AND FALSE-COGNATES
    Uzun, Levent
    Salihoglu, Umut M.
    [J]. POZNAN STUDIES IN CONTEMPORARY LINGUISTICS, 2021, 57 (02): : 325 - 327