Evaluation of N-Gram Conflation Approaches for Arabic Text Retrieval

被引:18
|
作者
Ahmed, Farag [1 ]
Nuernberger, Andreas [1 ]
机构
[1] Otto VonGuericke Univ Magdegurg, Data & Knowledge Engn Grp, Dept Tech & Operat Informat Syst, Fac Comp Sci, Magdeburg, Germany
关键词
NATURAL-LANGUAGE; SEARCH;
D O I
10.1002/asi.21063
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In this paper we present a language-independent approach for conflation that does not depend on predefined rules or prior knowledge of the target language. The proposed unsupervised method is based on an enhancement of the pure n-gram model that can group related words based on various string-similarity measures, while restricting the search to specific locations of the target word by taking into account the order of n-grams. We show that the method is effective to achieve high score similarities for all word-form variations and reduces the ambiguity, i.e., obtains a higher precision and recall, compared to pure n-gram-based approaches for English, Portuguese, and Arabic. The proposed method is especially suited for conflation approaches in Arabic, since Arabic is a highly inflectional language. Therefore, we present in addition an adaptive user interface for Arabic text retrieval called "araSearch". araSearch serves as a metasearch interface to existing search engines. The system is able to extend a query using the proposed conflation approach such that additional results for relevant subwords can be found automatically.
引用
收藏
页码:1448 / 1465
页数:18
相关论文
共 50 条
  • [1] Evaluation of N-gram term conflation approach for arabic texts
    Abu-Salem, H
    [J]. PROCEEDINGS OF 2003 INTERNATIONAL CONFERENCE ON MANAGEMENT SCIENCE & ENGINEERING, VOLS I AND II, 2003, : 2561 - 2567
  • [2] N-gram and local context analysis for Persian text retrieval
    Aleahmad, Abolfazl
    Hakimian, Parsia
    Mahdikhani, Farzad
    Oroumchian, Farhad
    [J]. 2007 9TH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, VOLS 1-3, 2007, : 284 - 287
  • [3] Character N-Gram Tokenization for European Language Text Retrieval
    Paul McNamee
    James Mayfield
    [J]. Information Retrieval, 2004, 7 : 73 - 97
  • [4] Character N-gram tokenization for European language text retrieval
    McNamee, P
    Mayfield, J
    [J]. INFORMATION RETRIEVAL, 2004, 7 (1-2): : 73 - 97
  • [5] Improving arabic information retrieval system using n-gram method
    Legal Informatics center, Lebanese University, Sami Solh Street-Bp5396/116, Lebanon
    不详
    不详
    [J]. WSEAS Trans. Comput, 4 (125-133):
  • [6] Character-Based N-gram Model for Uyghur Text Retrieval
    Tohti, Turdi
    Xu, Lirui
    Huang, Jimmy
    Musajan, Winira
    Hamdulla, Askar
    [J]. BIOMETRIC RECOGNITION, CCBR 2018, 2018, 10996 : 678 - 688
  • [7] Google N-Gram Viewer does not Include Arabic Corpus! Towards N-Gram Viewer for Arabic Corpus
    Alsmadi, Izzat
    Zarour, Mohammad
    [J]. INTERNATIONAL ARAB JOURNAL OF INFORMATION TECHNOLOGY, 2018, 15 (05) : 785 - 794
  • [8] An Evaluation of Character Level N-gram Termsets in Text Categorization
    Coban, Onder
    Ozel, Selma Ayse
    [J]. 2018 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND DATA PROCESSING (IDAP), 2018,
  • [9] Text mining with n-gram variables
    Schonlau, Matthias
    Guenther, Nick
    Sucholutsky, Ilia
    [J]. STATA JOURNAL, 2017, 17 (04): : 866 - 881
  • [10] SEARCHING FOR TEXT - SEND AN N-GRAM
    KIMBRELL, RE
    [J]. BYTE, 1988, 13 (05): : 297 - &