UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation

被引:0
|
作者
Tian, Liang [1 ]
Wong, Derek F. [1 ]
Chao, Lidia S. [1 ]
Quaresma, Paulo [2 ,3 ,4 ]
Oliveira, Francisco [1 ]
Lu, Yi [1 ]
Li, Shuo [1 ]
Wang, Yiming [1 ]
Wang, Longyue [1 ]
机构
[1] Univ Macau, Dept Comp & Informat Sci, Lab NLP2CT, Macau, Peoples R China
[2] Univ Macau, Dept Portuguese, Macau, Peoples R China
[3] Univ Evora, INESC ID L2F, Evora, Portugal
[4] Univ Evora, Dept Comp Sci, Evora, Portugal
关键词
English-Chinese parallel corpus; statistical machine translation; different domains;
D O I
暂无
中图分类号
H0 [语言学];
学科分类号
030303 ; 0501 ; 050102 ;
摘要
Parallel corpus is a valuable resource for cross-language information retrieval and data-driven natural language processing systems, especially for Statistical Machine Translation (SMT). However, most existing parallel corpora to Chinese are subject to in-house use, while others are domain specific and limited in size. To a certain degree, this limits the SMT research. This paper describes the acquisition of a large scale and high quality parallel corpora for English and Chinese. The corpora constructed in this paper contain about 15 million English-Chinese (E-C) parallel sentences, and more than 2 million training data and 5,000 testing sentences are made publicly available. Different from previous work, the corpus is designed to embrace eight different domains. Some of them are further categorized into different topics. The corpus will be released to the research community, which is available at the (NLPCT1)-C-2 website.
引用
收藏
页码:1837 / 1842
页数:6
相关论文
共 50 条
  • [1] Teaching Design for Translation Based on English-Chinese Parallel Corpus
    Sun, Lihua
    Li, Zhiyuan
    [J]. 2017 2ND EBMEI INTERNATIONAL CONFERENCE ON EDUCATION, INFORMATION AND MANAGEMENT (EBMEI-EIM 2017, 2017, 85 : 57 - 60
  • [2] ParaMed: a parallel corpus for English-Chinese translation in the biomedical domain
    Liu, Boxiang
    Huang, Liang
    [J]. BMC MEDICAL INFORMATICS AND DECISION MAKING, 2021, 21 (01)
  • [3] A co-evolutionary algorithm to cluster translation equivalents in English-Chinese parallel corpus
    Yun, Jiali
    Wang, Weiqun
    He, Jun
    [J]. PROGRESS IN INTELLIGENCE COMPUTATION AND APPLICATIONS, PROCEEDINGS, 2007, : 19 - 23
  • [4] Building an English-Chinese Parallel Corpus Annotated with Sub-sentential Translation Techniques
    Zhai, Yuming
    Liu, Lufei
    Zhong, Xinyi
    Illouz, Gabriel
    Vilnat, Anne
    [J]. PROCEEDINGS OF THE 12TH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2020), 2020, : 4024 - 4033
  • [5] Corpus-Based Studies of Translational Chinese in English-Chinese Translation
    Su, Wenchao
    Li, Defeng
    [J]. DIGITAL SCHOLARSHIP IN THE HUMANITIES, 2016, 31 (03) : 516 - 519
  • [6] RETRACTED: English-Chinese Machine Translation Based on Transfer Learning and Chinese-English Corpus (Retracted Article)
    Xu, Bo
    [J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2022, 2022
  • [7] Correspondence Analysis of English-Chinese Contrast Relationship and Adverbial Module in the Construction of Parallel Translation Corpus
    Deng, Tao
    [J]. 2018 4TH INTERNATIONAL CONFERENCE ON EDUCATION, MANAGEMENT AND INFORMATION TECHNOLOGY (ICEMIT 2018), 2018, : 870 - 873
  • [8] Big-Data Based English-Chinese Corpus Collection and Mining and Machine Translation Framework
    Guo, Hang
    Jiang, Liu
    [J]. PROCEEDINGS OF THE 2021 FIFTH INTERNATIONAL CONFERENCE ON I-SMAC (IOT IN SOCIAL, MOBILE, ANALYTICS AND CLOUD) (I-SMAC 2021), 2021, : 418 - 421
  • [9] Improving Parallel Corpus Quality for Chinese-Vietnamese Statistical Machine Translation
    Huu-anh Tran
    Yuhang Guo
    Ping Jian
    Shumin Shi
    Heyan Huang
    [J]. Journal of Beijing Institute of Technology, 2018, 27 (01) : 127 - 136
  • [10] Improving Parallel Corpus Quality for Chinese-Vietnamese Statistical Machine Translation
    Tran H.-A.
    Guo Y.
    Jian P.
    Shi S.
    Huang H.
    [J]. Journal of Beijing Institute of Technology (English Edition), 2018, 27 (01): : 127 - 136