Extracting an English-Persian Parallel Corpus from Comparable Corpora

被引:0
|
作者
Karimi, Akbar [1 ]
Ansari, Ebrahim [1 ]
Bigham, Bahram Sadeghi [1 ]
机构
[1] Inst Adv Studies Basic Sci IASBS, Dept Comp Sci & Informat Technol, Zanjan, Iran
关键词
Parallel Sentence Extraction; Comparable Corpora; Statistical Machine Translation; Wikipedia; English-Persian Corpus;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Parallel data are an important part of a reliable Statistical Machine Translation (SMT) system. The more of these data are available, the better the quality of the SMT system. However, for some language pairs such as Persian-English, parallel sources of this kind are scarce. In this paper, a bidirectional method is proposed to extract parallel sentences from English and Persian document aligned Wikipedia. Two machine translation systems are employed to translate from Persian to English and the reverse after which an IR system is used to measure the similarity of the translated sentences. Adding the extracted sentences to the training data of the existing SMT systems is shown to improve the quality of the translation. Furthermore, the proposed method slightly outperforms the one-directional approach. The extracted corpus consists of about 200,000 sentences which have been sorted by their degree of similarity calculated by the IR system and is freely available for public access on the Web(1).
引用
收藏
页码:3477 / 3482
页数:6
相关论文
共 50 条
  • [1] TEP: Tehran English-Persian Parallel Corpus
    Pilevar, Mohammad Taher
    Faili, Heshaam
    Pilevar, Abdol Hamid
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, PT II, 2011, 6609 : 68 - +
  • [2] Extracting Parallel Paragraphs and Sentences from English-Persian Translated Documents
    Rasooli, Mohammad Sadegh
    Kashefi, Omid
    Minaei-Bidgoli, Behrouz
    [J]. INFORMATION RETRIEVAL TECHNOLOGY, 2011, 7097 : 574 - 583
  • [3] TPC: An Automatically Generated Comprehensive English-Persian Parallel Corpus
    Farzi, Saeed
    Faili, Heshaam
    [J]. 2017 5TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL AND BUSINESS INTELLIGENCE (ISCBI), 2017, : 91 - 95
  • [4] Constructing a Large-Scale English-Persian Parallel Corpus
    Miangah, Tayebeh Mosavi
    [J]. META, 2009, 54 (01) : 181 - 188
  • [5] Extracting Parallel Phrases from Comparable Corpora
    Zhang, Jiexin
    Cao, Hailong
    Zhao, Tiejun
    [J]. PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP 2014), 2014, : 166 - 169
  • [6] Extracting Directional and Comparable Corpora from a Multilingual Corpus for Translation Studies
    Cartoni, Bruno
    Meyer, Thomas
    [J]. LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2012, : 2132 - 2137
  • [7] Matching Graph, a Method for Extracting Parallel Information from Comparable Corpora
    Bakhshaei, Somayeh
    Safabakhsh, Reza
    Khadivi, Shahram
    [J]. ACM TRANSACTIONS ON ASIAN AND LOW-RESOURCE LANGUAGE INFORMATION PROCESSING, 2020, 19 (01)
  • [8] Creating a Persian-English Comparable Corpus
    Hashemi, Homa Baradaran
    Shakery, Azadeh
    Faili, Heshaam
    [J]. MULTILINGUAL AND MULTIMODAL INFORMATION ACCESS EVALUATION, 2010, 6360 : 27 - 39
  • [9] Building English - Punjabi Aligned Parallel Corpora of Nouns from Comparable Corpora
    Kaur, Dilshad
    Singh, Satwinder
    [J]. APPLIED COMPUTER SYSTEMS, 2023, 28 (02) : 245 - 251
  • [10] Expand-Extract: A Parallel Corpus Mining Framework from Comparable Corpora for English Myanmar Machine Translation
    Zin, May Myo
    Racharak, Teeradaj
    Minh Le Ngnyen
    [J]. 2022 IEEE 34TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, ICTAI, 2022, : 259 - 266