WCC-JC 2.0: A Web-Crawled and Manually Aligned Parallel Corpus for Japanese-Chinese Neural Machine Translation

被引:4
|
作者
Zhang, Jinyi [1 ]
Tian, Ye [2 ]
Mao, Jiannan [3 ]
Han, Mei [4 ]
Wen, Feng [1 ]
Guo, Cong [1 ]
Gao, Zhonghui [1 ]
Matsumoto, Tadahiro [3 ]
机构
[1] Shenyang Ligong Univ, Sch Informat Sci & Engn, Shenyang 110159, Peoples R China
[2] Zhuzhou CRRC Times Elect Co Ltd, Zhuzhou 412001, Peoples R China
[3] Gifu Univ, Fac Engn, Gifu 5011193, Japan
[4] Hunan Univ Technol, Sch Elect & Informat Engn, Zhuzhou 412007, Peoples R China
关键词
Japanese-Chinese parallel corpus; neural machine translation; construction of the parallel corpus; manually aligned corpus;
D O I
10.3390/electronics12051140
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Movie and TV subtitles are frequently employed in natural language processing (NLP) applications, but there are limited Japanese-Chinese bilingual corpora accessible as a dataset to train neural machine translation (NMT) models. In our previous study, we effectively constructed a corpus of a considerable size containing bilingual text data in both Japanese and Chinese by collecting subtitle text data from websites that host movies and television series. The unsatisfactory translation performance of the initial corpus, Web-Crawled Corpus of Japanese and Chinese (WCC-JC 1.0), was predominantly caused by the limited number of sentence pairs. To address this shortcoming, we thoroughly analyzed the issues associated with the construction of WCC-JC 1.0 and constructed the WCC-JC 2.0 corpus by first collecting subtitle data from movie and TV series websites. Then, we manually aligned a large number of high-quality sentence pairs. Our efforts resulted in a new corpus that includes about 1.4 million sentence pairs, an 87% increase compared with WCC-JC 1.0. As a result, WCC-JC 2.0 is now among the largest publicly available Japanese-Chinese bilingual corpora in the world. To assess the performance of WCC-JC 2.0, we calculated the BLEU scores relative to other comparative corpora and performed manual evaluations of the translation results generated by translation models trained on WCC-JC 2.0. We provide WCC-JC 2.0 as a free download for research purposes only.
引用
收藏
页数:15
相关论文
共 8 条
  • [1] WCC-JC: A Web-Crawled Corpus for Japanese-Chinese Neural Machine Translation
    Zhang, Jinyi
    Tian, Ye
    Mao, Jiannan
    Han, Mei
    Matsumoto, Tadahiro
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (12):
  • [2] WCC-EC 2.0: Enhancing Neural Machine Translation with a 1.6M+ Web-Crawled English-Chinese Parallel Corpus
    Zhang, Jinyi
    Su, Ke
    Tian, Ye
    Matsumoto, Tadahiro
    [J]. ELECTRONICS, 2024, 13 (07)
  • [3] Human evaluation of web-crawled parallel corpora for machine translation
    Ramirez-Sanchez, Gema
    Banon, Marta
    Zaragoza-Bernabeu, Jaume
    Ortiz-Rojas, Sergio
    [J]. PROCEEDINGS OF THE 2ND WORKSHOP ON HUMAN EVALUATION OF NLP SYSTEMS (HUMEVAL 2022), 2022, : 32 - 41
  • [4] Filtering of Noisy Web-Crawled Parallel Corpus: the Japanese-Bulgarian Language Pair
    Nikolova-Stoupak, Iglika
    Shimizu, Shuichiro
    Chu, Chenhui
    Kurohashi, Sadao
    [J]. PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE COMPUTATIONAL LINGUISTICS IN BULGARIA, CLIB 2022, 2022, : 39 - 48
  • [5] Corpus Augmentation for Neural Machine Translation with Chinese-Japanese Parallel Corpora
    Zhang, Jinyi
    Matsumoto, Tadahiro
    [J]. APPLIED SCIENCES-BASEL, 2019, 9 (10):
  • [6] Character Decomposition for Japanese-Chinese Character-Level Neural Machine Translation
    Zhang, Jinyi
    Matsumoto, Tadahiro
    [J]. PROCEEDINGS OF THE 2019 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2019, : 35 - 40
  • [7] Improving Character-level Japanese-Chinese Neural Machine Translation with Radicals as an Additional Input Feature
    Zhang, Jinyi
    Matsumoto, Tadahiro
    [J]. 2017 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2017, : 172 - 175
  • [8] An Enhanced Method for Neural Machine Translation via Data Augmentation Based on the Self-Constructed English-Chinese Corpus, WCC-EC
    Zhang, Jinyi
    Guo, Cong
    Mao, Jiannan
    Guo, Chong
    Matsumoto, Tadahiro
    [J]. IEEE ACCESS, 2023, 11 : 112123 - 112132