A High-Quality Multilingual Dataset for Structured Documentation Translation

被引:0
|
作者
Hashimoto, Kazuma [1 ]
Buschiazzo, Raffaella [1 ]
Bradbury, James [1 ,2 ]
Marshall, Teresa [1 ]
Socher, Richard [1 ]
Xiong, Caiming [1 ]
机构
[1] Salesforce, San Francisco, CA 94301 USA
[2] Google Brain, Mountain View, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a high-quality multilingual dataset for the documentation domain to advance research on localization of structured text. Unlike widely-used datasets for translation of plain text, we collect XML-structured parallel text segments from the online documentation for an enterprise software platform. These Web pages have been professionally translated from English into 16 languages and maintained by domain experts, and around 100,000 text segments are available for each language pair. We build and evaluate translation models for seven target languages from English, with several different copy mechanisms and an XML-constrained beam search. We also experiment with a non-English pair to show that our dataset has the potential to explicitly enable 17 x 16 translation settings. Our experiments show that learning to translate with the XML tags improves translation accuracy, and the beam search accurately generates XML structures. We also discuss trade-offs of using the copy mechanisms by focusing on translation of numerical words and named entities. We further provide a detailed human analysis of gaps between the model output and human translations for real-world applications, including suitability for post-editing.
引用
收藏
页码:116 / 127
页数:12
相关论文
共 50 条
  • [21] High-Quality Face Caricature via Style Translation
    Laishram, Lamyanba
    Shaheryar, Muhammad
    Lee, Jong Taek
    Jung, Soon Ki
    IEEE ACCESS, 2023, 11 : 138882 - 138896
  • [22] High-quality FLORET UTE imaging for clinical translation
    Willmering, Matthew M.
    Krishnamoorthy, Guruprasad
    Robison, Ryan K.
    Rosenberg, Jens T.
    Woods, Jason C.
    Pipe, James G.
    MAGNETIC RESONANCE IN MEDICINE, 2024,
  • [23] Automatic Spanish Translation of the SQuAD Dataset for Multilingual Question Answering
    Carrino, Casimiro Pio
    Costa-jussa, Marta R.
    Fonollosa, Jose A. R.
    PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 5515 - 5523
  • [24] BOLOGNA TRANSLATION SERVICE: HIGH-QUALITY AUTOMATED TRANSLATION OF STUDY PROGRAMMES INTO ENGLISH
    Van de Walle, Joeri
    Depraetere, Heidi
    Pietrzak, Justyna
    EDULEARN13: 5TH INTERNATIONAL CONFERENCE ON EDUCATION AND NEW LEARNING TECHNOLOGIES, 2013, : 2061 - 2070
  • [25] Quantifying the Effect of Machine Translation in a High-Quality Human Translation Production Process
    Macken, Lieve
    Prou, Daniel
    Tezcan, Arda
    INFORMATICS-BASEL, 2020, 7 (02):
  • [26] BOLOGNA TRANSLATION SERVICE: HIGH-QUALITY AUTOMATED TRANSLATION OF STUDY PROGRAMMES INTO ENGLISH
    Van de Walle, Joeri
    Depraetere, Heidi
    Pietrzak, Justyna
    EDULEARN12: 4TH INTERNATIONAL CONFERENCE ON EDUCATION AND NEW LEARNING TECHNOLOGIES, 2012, : 5831 - 5835
  • [27] The Indian Ocean HydroBase: A high-quality climatological dataset for the Indian Ocean
    Kobayashi, T
    Suga, T
    PROGRESS IN OCEANOGRAPHY, 2006, 68 (01) : 75 - 114
  • [28] BotanicGarden: A High-Quality Dataset for Robot Navigation in Unstructured Natural Environments
    Liu, Yuanzhi
    Fu, Yujia
    Qin, Minghui
    Xu, Yufeng
    Xu, Baoxin
    Chen, Fengdong
    Goossens, Bart
    Sun, Poly Z. H.
    Yu, Hongwei
    Liu, Chun
    Chen, Long
    Tao, Wei
    Zhao, Hui
    IEEE ROBOTICS AND AUTOMATION LETTERS, 2024, 9 (03) : 2798 - 2805
  • [29] Antisemitic Messages? A Guide to High-Quality Annotation and a Labeled Dataset of Tweets
    Jikeli, Gunther
    Karali, Sameer
    Miehling, Daniel
    Soemer, Katharina
    arXiv, 2023,
  • [30] A high-quality dataset construction method for text mining in materials science
    Yue, Liu
    Da-Hui, Liu
    Xian-Yuan, Ge
    Zheng-Wei, Yang
    Shu-Chang, Ma
    Zhe-Yi, Zou
    Si-Qi, Shi
    ACTA PHYSICA SINICA, 2023, 72 (07)