Generation of Compound Words in Statistical Machine Translation into Compounding Languages

被引:5
|
作者
Stymne, Sara [1 ]
Cancedda, Nicola [2 ]
Ahrenberg, Lars [3 ]
机构
[1] Uppsala Univ, Dept Linguist & Philol, S-75126 Uppsala, Sweden
[2] Xerox Res Ctr Europe, F-38240 Meylan, France
[3] Linkoping Univ, Dept Comp & Informat Sci, S-58183 Linkoping, Sweden
关键词
D O I
10.1162/COLI_a_00162
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this article we investigate statistical machine translation (SMT) into Germanic languages, with a focus on compound processing. Our main goal is to enable the generation of novel compounds that have not been seen in the training data. We adopt a split-merge strategy, where compounds are split before training the SMT system, and merged after the translation step. This approach reduces sparsity in the training data, but runs the risk of placing translations of compound parts in non-consecutive positions. It also requires a postprocessing step of compound merging, where compounds are reconstructed in the translation output. We present a method for increasing the chances that components that should be merged are translated into contiguous positions and in the right order and show that it can lead to improvements both by direct inspection and in terms of standard translation evaluation metrics. We also propose several new methods for compound merging, based on heuristics and machine learning, which outperform previously suggested algorithms. These methods can produce novel compounds and a translation with at least the same overall quality as the baseline. For all subtasks we show that it is useful to include part-of-speech based information in the translation process, in order to handle compounds.
引用
收藏
页码:1067 / 1108
页数:42
相关论文
共 50 条
  • [1] Statistical machine translation of German compound words
    Popovic, Maja
    Stein, Daniel
    Ney, Hermann
    [J]. ADVANCES IN NATURAL LANGUAGE PROCESSING, PROCEEDINGS, 2006, 4139 : 616 - 624
  • [2] Synthesizing Compound Words for Machine Translation
    Matthews, Austin
    Schlinger, Eva
    Lavie, Alon
    Dyer, Chris
    [J]. PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 1085 - 1094
  • [3] Statistical machine translation of Indian languages: a survey
    Nadeem Khan Jadoon
    Waqas Anwar
    Usama Ijaz Bajwa
    Farooq Ahmad
    [J]. Neural Computing and Applications, 2019, 31 : 2455 - 2467
  • [4] Statistical machine translation of Indian languages: a survey
    Jadoon, Nadeem Khan
    Anwar, Waqas
    Bajwa, Usama Ijaz
    Ahmad, Farooq
    [J]. NEURAL COMPUTING & APPLICATIONS, 2019, 31 (07): : 2455 - 2467
  • [5] Statistical Machine Translation System for Indian Languages
    Raju, B. N. V. Narasimha
    Raju, M. S. V. S. Bhadri
    [J]. 2016 IEEE 6TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING (IACC), 2016, : 174 - 177
  • [6] Dealing with unknown words in statistical machine translation
    Silva, Joao
    Coheur, Luisa
    Costa, Angela
    Trancoso, Isabel
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 3977 - 3981
  • [7] A Study of Statistical Machine Translation Methods for Under Resourced Languages
    Pa, Win Pa
    Thu, Ye Kyaw
    Finch, Andrew
    Sumita, Eiichiro
    [J]. SLTU-2016 5TH WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGIES FOR UNDER-RESOURCED LANGUAGES, 2016, 81 : 250 - 257
  • [8] Generation of word graphs in statistical machine translation
    Ueffing, N
    Och, FJ
    Ney, H
    [J]. PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, 2002, : 156 - 163
  • [9] MACHINE TRANSLATION OF LANGUAGES
    TATE, VD
    [J]. AMERICAN DOCUMENTATION, 1956, 7 (02): : 135 - 136
  • [10] A Substitution-Translation-Restoration Framework for Handling Unknown Words in Statistical Machine Translation
    Zhang, Jia-Jun
    Zhai, Fei-Fei
    Zong, Cheng-Qing
    [J]. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY, 2013, 28 (05) : 907 - 918