North Korean Neural Machine Translation through South Korean Resources

被引:0
|
作者
Kim, Hwichan [1 ]
Tosho, Hirasawa [1 ]
Moon, Sangwhan [2 ]
Okazaki, Naoaki [2 ]
Komachi, Mamoru [1 ]
机构
[1] Tokyo Metropolitan Univ, 6-6 Asahigaoka, Hino, Tokyo 1910065, Japan
[2] Tokyo Inst Technol, 2-12-1 Ookayama, Meguro, Tokyo 1528550, Japan
关键词
Low resource; parallel data construction; pre-process; north korean machine translation;
D O I
10.1145/3608947
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
South and North Korea both use the Korean language. However, Korean natural language processing (NLP) research has mostly focused on South Korean language. Therefore, existing NLP systems in the Korean language, such as neural machine translation (NMT) systems, cannot properly process North Korean inputs. Training a model using North Korean data is the most straightforward approach to solving this problem, but the data to train NMT models are insufficient. To solve this problem, we constructed a parallel corpus to develop a North Korean NMT model using a comparable corpus. We manually aligned parallel sentences to create evaluation data and automatically aligned the remaining sentences to create training data. We trained a North Korean NMT model using our North Korean parallel data and improved North Korean translation quality using South Korean resources such as parallel data and a pre-trained model. In addition, we propose Korean-specific pre-processing methods, character tokenization, and phoneme decomposition to use the South Korean resources more efficiently. We demonstrate that the phoneme decomposition consistently improves the North Korean translation accuracy compared to other pre-processing methods.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Ancient Korean Neural Machine Translation
    Park, Chanjun
    Lee, Chanhee
    Yang, Yeongwook
    Lim, Heuiseok
    [J]. IEEE ACCESS, 2020, 8 : 116617 - 116625
  • [2] Learning How to Translate North Korean through South Korean
    Kim, Hwichan
    Moon, Sangwhan
    Okazaki, Naoaki
    Komachi, Mamoru
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 6711 - 6718
  • [3] Priming Ancient Korean Neural Machine Translation
    Park, Chanjun
    Lee, Seolhwa
    Seo, Jaehyung
    Moon, Hyeonseok
    Eo, Sugyeong
    Lim, Heuiseok
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 22 - 28
  • [4] Korean Neural Machine Translation Using Hierarchical Word Structure
    Park, Jeonghyeok
    Zhao, Hai
    [J]. 2020 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2020), 2020, : 294 - 298
  • [5] Korean Neural Machine Translation Using Hierarchical Word Structure
    Park, Jeonghyeok
    Zhao, Hai
    [J]. 2020 International Conference on Asian Language Processing, IALP 2020, 2020, : 294 - 298
  • [6] South Korean Scholars Studying North Korean Movies
    Yoon, Jiwon
    [J]. ASIAN CINEMA, 2007, 18 (02) : 160 - 179
  • [7] Zero-shot North Korean to English Neural Machine Translation by Character Tokenization and Phoneme Decomposition
    Kim, Hwichan
    Hirasawa, Tosho
    Komachi, Mamoru
    [J]. 58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): STUDENT RESEARCH WORKSHOP, 2020, : 72 - 78
  • [8] Korean-Vietnamese Neural Machine Translation System With Korean Morphological Analysis and Word Sense Disambiguation
    Quang-Phuoc Nguyen
    Vo, Anh-Dung
    Shin, Joon-Choul
    Phuoc Tran
    Ock, Cheol-Young
    [J]. IEEE ACCESS, 2019, 7 : 32602 - 32616
  • [9] Context-Aware Neural Machine Translation for Korean Honorific Expressions
    Hwang, Yongkeun
    Kim, Yanghoon
    Jung, Kyomin
    [J]. ELECTRONICS, 2021, 10 (13)
  • [10] Neural Machine Translation Strategies for Generating Honorific-style Korean
    Wang, Lijie
    Tu, Mei
    Zhai, Mengxia
    Wang, Huadong
    Liu, Song
    Kim, Sang Ha
    [J]. PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 450 - 455