A High-Quality Multilingual Dataset for Structured Documentation Translation

被引:0
|
作者
Hashimoto, Kazuma [1 ]
Buschiazzo, Raffaella [1 ]
Bradbury, James [1 ,2 ]
Marshall, Teresa [1 ]
Socher, Richard [1 ]
Xiong, Caiming [1 ]
机构
[1] Salesforce, San Francisco, CA 94301 USA
[2] Google Brain, Mountain View, CA USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
This paper presents a high-quality multilingual dataset for the documentation domain to advance research on localization of structured text. Unlike widely-used datasets for translation of plain text, we collect XML-structured parallel text segments from the online documentation for an enterprise software platform. These Web pages have been professionally translated from English into 16 languages and maintained by domain experts, and around 100,000 text segments are available for each language pair. We build and evaluate translation models for seven target languages from English, with several different copy mechanisms and an XML-constrained beam search. We also experiment with a non-English pair to show that our dataset has the potential to explicitly enable 17 x 16 translation settings. Our experiments show that learning to translate with the XML tags improves translation accuracy, and the beam search accurately generates XML structures. We also discuss trade-offs of using the copy mechanisms by focusing on translation of numerical words and named entities. We further provide a detailed human analysis of gaps between the model output and human translations for real-world applications, including suitability for post-editing.
引用
下载
收藏
页码:116 / 127
页数:12
相关论文
共 50 条
  • [1] KC4MT: A High-Quality Corpus for Multilingual Machine Translation
    Nguyen, Van-Vinh
    Nguyen-Tien, Ha
    Le-Thanh, Huong
    Nguyen, Phuong-Thai
    Bui, Van-Tan
    Pham, Nghia-Luan
    Phan, Tuan-Anh
    Hoang, Minh-Cong Nguyen
    Tran, Hong-Viet
    Tran, Huu-Anh
    LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 5494 - 5502
  • [2] Creating high-quality radiology reports in foreign languages through multilingual structured reporting
    Sobez, L. M.
    Kim, S. H.
    Angstwurm, M.
    Stoermann, S.
    Pfoerringer, D.
    Schmidutz, F.
    Prezzi, D.
    Kelly-Morland, C.
    Sommer, W. H.
    Sabel, B.
    Noerenberg, D.
    Berndt, M.
    Galie, F.
    EUROPEAN RADIOLOGY, 2019, 29 (11) : 6038 - 6048
  • [3] Creating high-quality radiology reports in foreign languages through multilingual structured reporting
    L. M. Sobez
    S. H. Kim
    M. Angstwurm
    S. Störmann
    D. Pförringer
    F. Schmidutz
    D. Prezzi
    C. Kelly-Morland
    W. H. Sommer
    B. Sabel
    D. Nörenberg
    M. Berndt
    F. Galiè
    European Radiology, 2019, 29 : 6038 - 6048
  • [4] VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research
    Wang, Xin
    Wu, Jiawei
    Chen, Junkun
    Li, Lei
    Wang, Yuan-Fang
    Wang, William Yang
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4580 - 4590
  • [5] A High-Quality and Large-Scale Dataset for English-Vietnamese Speech Translation
    Linh The Nguyen
    Nguyen Luong Tran
    Long Doan
    Manh Luong
    Dat Quoc Nguyen
    INTERSPEECH 2022, 2022, : 1726 - 1730
  • [6] An extensible approach to high-quality multilingual typesetting
    Plaice, J
    Haralambous, Y
    Rowley, C
    RIDE - MLIM 2003: THIRTEENTH INTERNATIONAL WORK SHOP ON RESEARCH ISSUES IN DATA ENGINEERING: MULTI-LINGUAL INFORMATION MANAGEMENT, PROCEEDINGS, 2003, : 62 - 67
  • [7] A High-Quality Denoising Dataset for Smartphone Cameras
    Abdelhamed, Abdelrahman
    Lin, Stephen
    Brown, Michael S.
    2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 1692 - 1700
  • [8] PartlmageNet: A Large, High-Quality Dataset of Parts
    He, Ju
    Yang, Shuo
    Yang, Shaokang
    Kortylewski, Adam
    Yuan, Xiaoding
    Chen, Jie-Neng
    Liu, Shuai
    Yang, Cheng
    Yu, Qihang
    Yuille, Alan
    COMPUTER VISION, ECCV 2022, PT VIII, 2022, 13668 : 128 - 145
  • [9] Chinese Chorales Dataset: A High-Quality Music Dataset for Score Generation
    Peng, Yongjie
    Zhang, Lei
    Wang, Zhenyu
    MUSIC INTELLIGENCE, SOMI 2023, 2024, 2007 : 135 - 146
  • [10] The Role of Medical Transcriptionists in Producing High-Quality Documentation
    Johansen, Monika A.
    Pedersen, Ase-Merete
    Ellingsen, Gunnar
    CONTEXT SENSITIVE HEALTH INFORMATICS: MANY PLACES, MANY USERS, MANY CONTEXTS, MANY USES, 2015, 218 : 114 - 119