Extracting translations from comparable corpora for Cross-Language Information Retrieval using the language modeling framework

被引：19

作者：

Rahimi, Razieh ^{[1
]}

Shakery, Azadeh ^{[1
,2
]}

King, Irwin ^{[3
]}

机构：

[1] Univ Tehran, Coll Engn, Sch Elect & Comp Engn, POB 14395-515, Tehran, Iran

[2] Inst Res Fundamental Sci IPM, Sch Comp Sci, POB 19395-5746, Tehran, Iran

[3] Chinese Univ Hong Kong, Dept Comp Sci & Engn, Shatin, Hong Kong, Peoples R China

来源：

INFORMATION PROCESSING & MANAGEMENT | 2016年 / 52卷 / 02期

关键词：

Translation model; Bilingual lexicon; Comparable corpora; Cross-Language Information Retrieval; Language modeling framework; CORPUS;

D O I：

10.1016/j.ipm.2015.08.001

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

A main challenge in Cross-Language Information Retrieval (CLIR) is to estimate a proper translation model from available translation resources, since translation quality directly affects the retrieval performance. Among different translation resources, we focus on obtaining translation models from comparable corpora, because they provide appropriate translations for both languages and domains with limited linguistic resources. In this paper, we employ a two-step approach to build an effective translation model from comparable corpora, without requiring any additional linguistic resources, for the CLIR task. In the first step, translations are extracted by deriving correlations between source-target word pairs. These correlations are used to estimate word translation probabilities in the second step. We propose a language modeling approach for the first step, where modeling based on probability distribution provides two key advantages. First, our approach can be tuned easier in comparison with heuristically adjusted previous work. Second, it provides a principled basis for integrating additional lexical and translational relations to improve the accuracy of translations from comparable corpora. As an indication, we integrate monolingual relations of word co-occurrences into the process of translation extraction, which helps to extract more reliable translations for low-frequency words in a comparable corpus. Experimental results on an English-Persian comparable corpus show that our method outperforms the previous approaches in terms of both translation quality and the performance of CUR. Indeed, the proposed method is naturally applicable to any comparable corpus, regardless of its languages. In addition, we demonstrate the significant impact of word translation probabilities, estimated in the second step of our approach, on the performance of CLIR. (C) 2015 Elsevier Ltd. All rights reserved.

引用

页码：299 / 318

页数：20

共 50 条

[21] A study on automatic creation of a comparable document collection in cross-language information retrieval
Talvensaari, Tuomas
Laurikkala, Jorma
Jarvelin, Kalervo
Juhola, Martti
JOURNAL OF DOCUMENTATION, 2006, 62 (03) : 372 - 387
[22] Cross-Language Information Retrieval with Latent Topic Models Trained on a Comparable Corpus
Vulic, Ivan
De Smet, Wim
Moens, Marie-Francine
INFORMATION RETRIEVAL TECHNOLOGY, 2011, 7097 : 37 - 48
[23] Mining a Persian-English comparable corpus for cross-language information retrieval
Hashemi, Homa B.
Shakery, Azadeh
INFORMATION PROCESSING & MANAGEMENT, 2014, 50 (02) : 384 - 398
[24] Exploiting Comparable Corpora and Bilingual Dictionaries for Cross-Language Text Categorization
Gliozzo, Alfio
Strapparava, Carlo
COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, 2006, : 553 - 560
[25] Patapasco: A Python']Python Framework for Cross-Language Information Retrieval Experiments
Costello, Cash
Yang, Eugene
Lawrie, Dawn
Mayfield, James
ADVANCES IN INFORMATION RETRIEVAL, PT II, 2022, 13186 : 276 - 280
[26] A Novel Method for Cross-Language Retrieval of Chunks Using Monolingual and Bilingual Corpora
Miangah, Tayebeh Mosavi
Nezarat, Amin
INFORMATION AND COMMUNICATION TECHNOLOGIES, 2010, 101 : 307 - +
[27] Multilingual information access system using cross-language information retrieval
Hayashi, Yoshihiko
Matsuo, Yoshihiro
Nagata, Masaaki
Furuse, Osamu
2003, Nippon Telegraph and Telephone Corp. (52):
[28] Easing erroneous translations in cross-language image retrieval using word associations
Inoue, Masashi
ACCESSING MULTILINGUAL INFORMATION REPOSITORIES, 2006, 4022 : 582 - 591
[29] Towards Web mining of query translations for cross-language information retrieval in digital libraries
Lu, WH
Wang, JH
Chien, LF
DIGITAL LIBRARIES: TECHNOLOGY AND MANAGEMENT OF INDIGENOUS KNOWLEDGE FOR GLOBAL ACCESS, 2003, 2911 : 86 - 99
[30] Neural Methods for Cross-Language Information Retrieval
Yang, Eugene
Lawrie, Dawn
Mayfield, James
Nair, Suraj
Oard, Douglas W.
PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 3430 - 3431

← 1 2 3 4 5 →