Extracting translations from comparable corpora for Cross-Language Information Retrieval using the language modeling framework

被引:19
|
作者
Rahimi, Razieh [1 ]
Shakery, Azadeh [1 ,2 ]
King, Irwin [3 ]
机构
[1] Univ Tehran, Coll Engn, Sch Elect & Comp Engn, POB 14395-515, Tehran, Iran
[2] Inst Res Fundamental Sci IPM, Sch Comp Sci, POB 19395-5746, Tehran, Iran
[3] Chinese Univ Hong Kong, Dept Comp Sci & Engn, Shatin, Hong Kong, Peoples R China
关键词
Translation model; Bilingual lexicon; Comparable corpora; Cross-Language Information Retrieval; Language modeling framework; CORPUS;
D O I
10.1016/j.ipm.2015.08.001
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
A main challenge in Cross-Language Information Retrieval (CLIR) is to estimate a proper translation model from available translation resources, since translation quality directly affects the retrieval performance. Among different translation resources, we focus on obtaining translation models from comparable corpora, because they provide appropriate translations for both languages and domains with limited linguistic resources. In this paper, we employ a two-step approach to build an effective translation model from comparable corpora, without requiring any additional linguistic resources, for the CLIR task. In the first step, translations are extracted by deriving correlations between source-target word pairs. These correlations are used to estimate word translation probabilities in the second step. We propose a language modeling approach for the first step, where modeling based on probability distribution provides two key advantages. First, our approach can be tuned easier in comparison with heuristically adjusted previous work. Second, it provides a principled basis for integrating additional lexical and translational relations to improve the accuracy of translations from comparable corpora. As an indication, we integrate monolingual relations of word co-occurrences into the process of translation extraction, which helps to extract more reliable translations for low-frequency words in a comparable corpus. Experimental results on an English-Persian comparable corpus show that our method outperforms the previous approaches in terms of both translation quality and the performance of CUR. Indeed, the proposed method is naturally applicable to any comparable corpus, regardless of its languages. In addition, we demonstrate the significant impact of word translation probabilities, estimated in the second step of our approach, on the performance of CLIR. (C) 2015 Elsevier Ltd. All rights reserved.
引用
收藏
页码:299 / 318
页数:20
相关论文
共 50 条
  • [21] A study on automatic creation of a comparable document collection in cross-language information retrieval
    Talvensaari, Tuomas
    Laurikkala, Jorma
    Jarvelin, Kalervo
    Juhola, Martti
    JOURNAL OF DOCUMENTATION, 2006, 62 (03) : 372 - 387
  • [22] Cross-Language Information Retrieval with Latent Topic Models Trained on a Comparable Corpus
    Vulic, Ivan
    De Smet, Wim
    Moens, Marie-Francine
    INFORMATION RETRIEVAL TECHNOLOGY, 2011, 7097 : 37 - 48
  • [23] Mining a Persian-English comparable corpus for cross-language information retrieval
    Hashemi, Homa B.
    Shakery, Azadeh
    INFORMATION PROCESSING & MANAGEMENT, 2014, 50 (02) : 384 - 398
  • [24] Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization
    Gliozzo, Alfio
    Strapparava, Carlo
    COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 553 - 560
  • [25] Patapasco: A Python']Python Framework for Cross-Language Information Retrieval Experiments
    Costello, Cash
    Yang, Eugene
    Lawrie, Dawn
    Mayfield, James
    ADVANCES IN INFORMATION RETRIEVAL, PT II, 2022, 13186 : 276 - 280
  • [26] A Novel Method for Cross-Language Retrieval of Chunks Using Monolingual and Bilingual Corpora
    Miangah, Tayebeh Mosavi
    Nezarat, Amin
    INFORMATION AND COMMUNICATION TECHNOLOGIES, 2010, 101 : 307 - +
  • [27] Multilingual information access system using cross-language information retrieval
    Hayashi, Yoshihiko
    Matsuo, Yoshihiro
    Nagata, Masaaki
    Furuse, Osamu
    2003, Nippon Telegraph and Telephone Corp. (52):
  • [28] Easing erroneous translations in cross-language image retrieval using word associations
    Inoue, Masashi
    ACCESSING MULTILINGUAL INFORMATION REPOSITORIES, 2006, 4022 : 582 - 591
  • [29] Towards Web mining of query translations for cross-language information retrieval in digital libraries
    Lu, WH
    Wang, JH
    Chien, LF
    DIGITAL LIBRARIES: TECHNOLOGY AND MANAGEMENT OF INDIGENOUS KNOWLEDGE FOR GLOBAL ACCESS, 2003, 2911 : 86 - 99
  • [30] Neural Methods for Cross-Language Information Retrieval
    Yang, Eugene
    Lawrie, Dawn
    Mayfield, James
    Nair, Suraj
    Oard, Douglas W.
    PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 3430 - 3431