BAYESIAN RETRIEVAL USING A SIMILARITY-BASED LEMMATIZER

被引:0
|
作者
Maragoudakis, Manolis [1 ]
Lyras, Dimitrios P. [2 ]
Sgarbas, Kyriakos [2 ]
机构
[1] Univ Aegean, Dept Informat & Commun Syst Engn, Samos, Greece
[2] Univ Patras, Dept Elect & Comp Engn, Wire Commun Lab, Artificial Intelligence Grp, GR-26500 Patras, Greece
关键词
Bayesian networks; modern Greek; AhR; Ad-hoc retrieval; lemmatization; AUTOMATIC LEMMATIZATION; INFORMATION-RETRIEVAL; MODERN GREEK; MODEL;
D O I
10.1142/S0218213012500248
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The present paper describes a Bayesian network approach to Information Retrieval (IR) from Web documents. The network structure provides an intuitive representation of uncertainty relationships and the embedded conditional probability table is used by inference algorithms in an attempt to identify documents that are relevant to the user's needs, expressed in the form of Boolean queries. Our research has been directed in constructing a probabilistic IR framework that focus on assisting users to perform Ad-hoc retrieval of documents from the various domains such as economics, news, sports, etc. Furthermore, users can integrate feedback regarding the relevance of the retrieved documents in an attempt to improve performance on upcoming requests. Towards these goals, we have expanded the traditional Bayesian network IR system and tested it on several Greek web corpora on different application domains. We have developed two different approaches with regards to the structure: a simple one, where the structure is manually provided, and an automated one, where data mining is used in order to extract the network's structure. Results have depicted competitive performance against successful IR models of different theoretical backgrounds, such as the vector space utilizing tf-idf and the probabilistic model of BM25 in terms of precision-recall curves. In order to further improve the performance of the IR system, we have implemented a novel similarity-based lemmatization framework, reducing thus the ambiguity posed by the plethora of morphological variations of the languages in question. The employed lemmatization framework comprises of 3 core components (i.e. the word segregation, the data cleansing and the lemmatization modules) and is language-independent (i.e. can be applied to other languages with morphological peculiarities and thus improve Ad-hoc retrieval) since it achieves the mapping of an input word to its normalized form by employing two state-of-the-art language independent distance metric models, meaning the Levenshtein Edit distance and the Dice coefficient similarity measure, combined with a language model describing the most frequent inflectional suffixes of the examined language. Experimental results support our claim on the significance of this incorporation to Greek texts web retrieval as results improve by a factor of 4% to 11%.
引用
收藏
页数:32
相关论文
共 50 条
  • [31] Structure label prediction using similarity-based retrieval and weakly supervised label mapping
    Alaudah, Yazeed
    Alfarraj, Motaz
    AlRegib, Ghassan
    GEOPHYSICS, 2019, 84 (01) : V67 - V79
  • [32] Similarity-based image retrieval system using partitioned iterated function system codes
    Takanori Yokoyama
    Ken Sugawara
    T. Watanabe
    Artificial Life and Robotics, 2004, 8 (2) : 118 - 122
  • [33] An efficient approach to similarity-based retrieval on top of relational databases
    Schumacher, J
    Bergmann, R
    ADVANCES IN CASE-BASED REASONING, PROCEEDINGS, 2001, 1898 : 273 - 284
  • [34] Similarity-based partial image retrieval system for engineering drawings
    Baba, T
    Liu, RJ
    Endo, S
    Shiitani, S
    Uehara, Y
    Masumoto, D
    Nagata, S
    ISM 2005: SEVENTH IEEE INTERNATIONAL SYMPOSIUM ON MULTIMEDIA, PROCEEDINGS, 2005, : 303 - 309
  • [35] Similarity-based Image Retrieval for Revealing Forgery of Handwritten Corpora
    Bartolini, Ilaria
    2015 12TH INTERNATIONAL JOINT CONFERENCE ON E-BUSINESS AND TELECOMMUNICATIONS (ICETE), VOL 5, 2015, : 104 - 112
  • [36] The interaction of predictive processing and similarity-based retrieval interference: an ERP study
    Schoknecht, Pia
    Roehm, Dietmar
    Schlesewsky, Matthias
    Bornkessel-Schlesewsky, Ina
    LANGUAGE COGNITION AND NEUROSCIENCE, 2022, 37 (07) : 883 - 901
  • [37] A Scene Graph Similarity-Based Remote Sensing Image Retrieval Algorithm
    Ren, Yougui
    Zhao, Zhibin
    Jiang, Junjian
    Jiao, Yuning
    Yang, Yining
    Liu, Dawei
    Chen, Kefu
    Yu, Ge
    APPLIED SCIENCES-BASEL, 2024, 14 (18):
  • [38] Similarity-based online feature selection in content-based image retrieval
    Jiang, W
    Er, G
    Dai, QH
    Gu, JW
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2006, 15 (03) : 702 - 712
  • [39] Goal-oriented and similarity-based retrieval of software engineering experienceware
    von Wangenheim, CG
    Althoff, KD
    Barcia, RM
    LEARNING SOFTWARE ORGANIZATIONS: METHODOLOGY AND APPLICATIONS, 2000, 1756 : 118 - 141
  • [40] Soft Seeded SSL Graphs for Unsupervised Semantic Similarity-based Retrieval
    Srivastava, Avikalp
    Datt, Madhav
    CIKM'17: PROCEEDINGS OF THE 2017 ACM CONFERENCE ON INFORMATION AND KNOWLEDGE MANAGEMENT, 2017, : 2315 - 2318