Extracting Parallel Phrases from Comparable Corpora

被引：0

作者：

Zhang, Jiexin ^{[1
]}

Cao, Hailong ^{[1
]}

Zhao, Tiejun ^{[1
]}

机构：

[1] Harbin Inst Technol, Sch Comp Sci & Technol, Harbin, Peoples R China

来源：

PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2014) | 2014年

基金：

对外科技合作项目（国际科技项目）; 中国国家自然科学基金;

关键词：

Statistical Machine Translation; comparable corpus; Support Vector Machine; classification;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The state-of-the-art statistical machine translation models are trained with the parallel corpora. However, the traditional SMT loses its power when it comes to language pairs with few bilingual resources. This paper proposes a novel method that treats the phrase extraction as a classification task. We first automatically generate the training and testing phrase pairs for the classifier. Then, we train a SVM classifier which can determine the phrase pairs are either parallel or non-parallel. The proposed approach is evaluated on the translation task of Chinese-English. Experimental results show that the precision of the classifier on test sets is above 75% and the accuracy is above 98%. The quality of the extracted data is also evaluated by measuring the impact on the performance of a state-of-the-art SMT system, which is built with a small parallel corpus. It shows better results over the baseline system.

引用

页码：166 / 169

页数：4

共 50 条

[1] Extracting parallel phrases from comparable data for machine translation
Hewavitharana, Sanjika
Vogel, Stephan
[J]. NATURAL LANGUAGE ENGINEERING, 2016, 22 (04) : 549 - 573
[2] Matching Graph, a Method for Extracting Parallel Information from Comparable Corpora
Bakhshaei, Somayeh
Safabakhsh, Reza
Khadivi, Shahram
[J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (01)
[3] Extracting an English-Persian Parallel Corpus from Comparable Corpora
Karimi, Akbar
Ansari, Ebrahim
Bigham, Bahram Sadeghi
[J]. PROCEEDINGS OF THE ELEVENTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION (LREC 2018), 2018, : 3477 - 3482
[4] Extracting translation equivalents from bilingual comparable corpora
Kaji, H
[J]. IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2005, E88D (02): : 313 - 323
[5] Extracting Multilingual Topics from Unaligned Comparable Corpora
Jagarlamudi, Jagadeesh
Daume, Hal, III
[J]. ADVANCES IN INFORMATION RETRIEVAL, PROCEEDINGS, 2010, 5993 : 444 - 456
[6] Extracting Multilingual Lexicons from Parallel Corpora
Dan Tufiş
Ana Maria Barbu
Radu Ion
[J]. Computers and the Humanities, 2004, 38 : 163 - 189
[7] Extracting multilingual lexicons from parallel corpora
Tufis, D
Barbu, AM
Ion, R
[J]. COMPUTERS AND THE HUMANITIES, 2004, 38 (02): : 163 - 189
[8] Parallel Sentence Alignment from Biomedical Comparable Corpora
Cardon, Remi
Grabar, Natalia
[J]. DIGITAL PERSONALIZED HEALTH AND MEDICINE, 2020, 270 : 362 - 366
[9] Iterative Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora
Dong, Meiping
Liu, Yang
Luan, Huanbo
Sun, Maosong
Izuha, Tatsuya
Zhang, Dakun
[J]. PROCEEDINGS OF THE TWENTY-FOURTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI), 2015, : 1250 - 1256
[10] Building English - Punjabi Aligned Parallel Corpora of Nouns from Comparable Corpora
Kaur, Dilshad
Singh, Satwinder
[J]. APPLIED COMPUTER SYSTEMS, 2023, 28 (02) : 245 - 251

← 1 2 3 4 5 →