Exploiting Wikipedia for cross-lingual and multilingual information retrieval

被引:53
|
作者
Sorg, P. [1 ]
Cimiano, P. [2 ]
机构
[1] KIT, Inst AIFB, D-76128 Karlsruhe, Germany
[2] Univ Bielefeld, CITEC, Semant Comp Grp, D-33615 Bielefeld, Germany
关键词
Cross-Lingual Information Retrieval; Concept-based Information Retrieval; Social Web; Wikipedia;
D O I
10.1016/j.datak.2012.02.003
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this article we show how Wikipedia as a multilingual knowledge resource can be exploited for Cross-Language and Multilingual Information Retrieval (CLIR/MLIR). We describe an approach we call Cross-Language Explicit Semantic Analysis (CL-ESA) which indexes documents with respect to explicit interlingual concepts. These concepts are considered as interlingual and universal and in our case correspond either to Wikipedia articles or categories. Each concept is associated to a text signature in each language which can be used to estimate language-specific term distributions for each concept. This knowledge can then be used to calculate the strength of association between a term and a concept which is used to map documents into the concept space. With CL-ESA we are thus moving from a Bag-Of-Words model to a Bag-Of-Concepts model that allows language-independent document representations in the vector space spanned by interlingual and universal concepts. We show how different vector-based retrieval models and term weighting strategies can be used in conjunction with CL-ESA and experimentally analyze the performance of the different choices. We evaluate the approach on a mate retrieval task on two datasets: JRC-Acquis and Multext. We show that in the MLIR settings, CL-ESA benefits from a certain level of abstraction in the sense that using categories instead of articles as in the original ESA model delivers better results. (C) 2012 Elsevier B.V. All rights reserved.
引用
收藏
页码:26 / 45
页数:20
相关论文
共 50 条
  • [1] On cross-lingual retrieval with multilingual text encoders
    Litschko, Robert
    Vulic, Ivan
    Ponzetto, Simone Paolo
    Glavas, Goran
    [J]. INFORMATION RETRIEVAL JOURNAL, 2022, 25 (02): : 149 - 183
  • [2] On cross-lingual retrieval with multilingual text encoders
    Robert Litschko
    Ivan Vulić
    Simone Paolo Ponzetto
    Goran Glavaš
    [J]. Information Retrieval Journal, 2022, 25 : 149 - 183
  • [3] Adversarial Domain Adaptation for Cross-lingual Information Retrieval with Multilingual BERT
    Wang, Runchuan
    Zhang, Zhao
    Zhuang, Fuzhen
    Gao, Dehong
    Wei, Yi
    He, Qing
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 3498 - 3502
  • [4] WikiTranslate: Query Translation for Cross-Lingual Information Retrieval Using Only Wikipedia
    Nguyen, Dong
    Overwijk, Arnold
    Hauff, Claudia
    Trieschnigg, Dolf R. B.
    Hiemstra, Djoerd
    de Jong, Franciska
    [J]. EVALUATING SYSTEMS FOR MULTILINGUAL AND MULTIMODAL INFORMATION ACCESS, 2009, 5706 : 58 - 65
  • [5] Detecting Cross-Lingual Information Gaps in Wikipedia
    Ashrafmoghari, Vahid
    [J]. COMPANION OF THE WORLD WIDE WEB CONFERENCE, WWW 2023, 2023, : 581 - 585
  • [6] Exploiting Wikipedia and EuroWordNet to solve Cross-Lingual Question Answering
    Ferrandez, Sergio
    Toral, Antonio
    Ferrandez, Oscar
    Ferrandez, Antonio
    Munoz, Rafael
    [J]. INFORMATION SCIENCES, 2009, 179 (20) : 3473 - 3488
  • [7] Applying Wikipedia's multilingual knowledge to Cross-Lingual question answering
    Ferrandez, Sergio
    Toral, Antonio
    Ferrandez, Oscar
    Ferrandez, Antonio
    Munoz, Rafael
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PROCEEDINGS, 2007, 4592 : 352 - +
  • [8] Semantic Cross-Lingual Information Retrieval
    Pourmahmoud, Solmaz
    Shamsfard, Mehrnoush
    [J]. 23RD INTERNATIONAL SYMPOSIUM ON COMPUTER AND INFORMATION SCIENCES, 2008, : 80 - +
  • [9] CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval
    Sun, Shuo
    Duh, Kevin
    [J]. PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 4160 - 4170
  • [10] Cross-lingual information retrieval by feature vectors
    Lilleng, Jeanine
    Tomassen, Stein L.
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS, PROCEEDINGS, 2007, 4592 : 229 - +