Extracting Parallel Phrases from Comparable Corpora

被引:0
|
作者
Zhang, Jiexin [1 ]
Cao, Hailong [1 ]
Zhao, Tiejun [1 ]
机构
[1] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin, Peoples R China
基金
对外科技合作项目(国际科技项目); 中国国家自然科学基金;
关键词
Statistical Machine Translation; comparable corpus; Support Vector Machine; classification;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The state-of-the-art statistical machine translation models are trained with the parallel corpora. However, the traditional SMT loses its power when it comes to language pairs with few bilingual resources. This paper proposes a novel method that treats the phrase extraction as a classification task. We first automatically generate the training and testing phrase pairs for the classifier. Then, we train a SVM classifier which can determine the phrase pairs are either parallel or non-parallel. The proposed approach is evaluated on the translation task of Chinese-English. Experimental results show that the precision of the classifier on test sets is above 75% and the accuracy is above 98%. The quality of the extracted data is also evaluated by measuring the impact on the performance of a state-of-the-art SMT system, which is built with a small parallel corpus. It shows better results over the baseline system.
引用
收藏
页码:166 / 169
页数:4
相关论文
共 50 条
  • [1] Extracting parallel phrases from comparable data for machine translation
    Hewavitharana, Sanjika
    Vogel, Stephan
    [J]. NATURAL LANGUAGE ENGINEERING, 2016, 22 (04) : 549 - 573
  • [2] Matching Graph, a Method for Extracting Parallel Information from Comparable Corpora
    Bakhshaei, Somayeh
    Safabakhsh, Reza
    Khadivi, Shahram
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (01)
  • [3] Extracting an English-Persian Parallel Corpus from Comparable Corpora
    Karimi, Akbar
    Ansari, Ebrahim
    Bigham, Bahram Sadeghi
    [J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3477 - 3482
  • [4] Extracting translation equivalents from bilingual comparable corpora
    Kaji, H
    [J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2005, E88D (02): : 313 - 323
  • [5] Extracting Multilingual Topics from Unaligned Comparable Corpora
    Jagarlamudi, Jagadeesh
    Daume, Hal, III
    [J]. ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, 2010, 5993 : 444 - 456
  • [6] Extracting Multilingual Lexicons from Parallel Corpora
    Dan Tufiş
    Ana Maria Barbu
    Radu Ion
    [J]. Computers and the Humanities, 2004, 38 : 163 - 189
  • [7] Extracting multilingual lexicons from parallel corpora
    Tufis, D
    Barbu, AM
    Ion, R
    [J]. COMPUTERS AND THE HUMANITIES, 2004, 38 (02): : 163 - 189
  • [8] Parallel Sentence Alignment from Biomedical Comparable Corpora
    Cardon, Remi
    Grabar, Natalia
    [J]. DIGITAL PERSONALIZED HEALTH AND MEDICINE, 2020, 270 : 362 - 366
  • [9] Iterative Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora
    Dong, Meiping
    Liu, Yang
    Luan, Huanbo
    Sun, Maosong
    Izuha, Tatsuya
    Zhang, Dakun
    [J]. PROCEEDINGS OF THE TWENTY-FOURTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI), 2015, : 1250 - 1256
  • [10] Building English - Punjabi Aligned Parallel Corpora of Nouns from Comparable Corpora
    Kaur, Dilshad
    Singh, Satwinder
    [J]. APPLIED COMPUTER SYSTEMS, 2023, 28 (02) : 245 - 251