Combining N-Grams and Stemming for Arabic Word-Based Inexact Matching and Term Conflation

被引:3
|
作者
Mustafa, Suleiman H. [1 ]
机构
[1] Yarmouk Univ, Dept Comp Informat Syst, Irbid, Jordan
关键词
N-grams; Arabic string matching; text searching; stemming; information retrieval; word conflation;
D O I
10.1142/S0219649205000992
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
In this paper, the results of three N-gram techniques have been reported. Two of these techniques were based on the idea of combining N-grams and stemming. The first used first-order stemming, while the other used light stemming. The performance of the combined approach was then compared with that of pure conventional N-gram-based string matching. The results provide good evidence that combining N-grams with stemming improves the overall performance, as measured by word-match recall and word-match precision, using different similarity threshold values.
引用
收藏
页码:29 / 36
页数:8
相关论文
共 16 条
  • [1] Corpus-Based Arabic Stemming Using N-Grams
    Zitouni, Abdelaziz
    Damankesh, Asma
    Barakati, Foroogh
    Atari, Maha
    Watfa, Mohamed
    Oroumchian, Farhad
    [J]. INFORMATION RETRIEVAL TECHNOLOGY, 2010, 6458 : 280 - 289
  • [2] Using Word N-Grams as Features in Arabic Text Classification
    Al-Thubaity, Abdulmohsen
    Alhoshan, Muneera
    Hazzaa, Itisam
    [J]. SOFTWARE ENGINEERING, ARTIFICIAL INTELLIGENCE, NETWORKING AND PARALLEL/DISTRIBUTED COMPUTING, 2015, 569 : 35 - 43
  • [3] Combining Word and Character N-grams for Detecting Deceptive Opinions
    Siagian, Al Hafiz Akbar Maulana
    Aritsugi, Masayoshi
    [J]. 2017 IEEE 41ST ANNUAL COMPUTER SOFTWARE AND APPLICATIONS CONFERENCE (COMPSAC), VOL 1, 2017, : 828 - 833
  • [4] Evaluation of N-grams conflation approach in text-based information retrieval
    Kosinov, S
    [J]. EIGHTH SYMPOSIUM ON STRING PROCESSING AND INFORMATION RETRIEVAL, PROCEEDINGS, 2001, : 136 - 142
  • [5] Diacritics restoration based on word n-grams for Slovak texts
    Toth, Stefan
    Zaymus, Emanuel
    Duracik, Michal
    Hrkut, Patrik
    Mesko, Matej
    [J]. OPEN COMPUTER SCIENCE, 2021, 11 (01): : 180 - 189
  • [6] Improvement of Imperfect String Matching Based on Asymmetric n-Grams
    Szymanski, Julian
    Boinski, Tomasz
    [J]. COMPUTATIONAL COLLECTIVE INTELLIGENCE: TECHNOLOGIES AND APPLICATIONS, 2013, 8083 : 306 - 315
  • [7] A Probabilistic Model Based on n-Grams for Bilingual Word Sense Disambiguation
    Vilarino, Darnes
    Pinto, David
    Tovar, Mireya
    Balderas, Carlos
    Beltran, Beatriz
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE, MICAI 2010, PT I, 2010, 6437 : 82 - 91
  • [8] Automatic restoration of diacritics based on word n-grams for Slovak texts
    Toth, Stefan
    Zaymus, Emanuel
    Duracik, Michal
    Mesko, Matej
    Hrkut, Patrik
    [J]. 2019 IEEE 15TH INTERNATIONAL SCIENTIFIC CONFERENCE ON INFORMATICS (INFORMATICS 2019), 2019, : 243 - 248
  • [9] AUTOMATIC RECOGNITION OF COMMON ARABIC HANDWRITTEN WORDS BASED ON OCR AND N-GRAMS
    Dinges, Laslo
    Al-Hamadi, Ayoub
    Elzobi, Moftah
    Nuernberger, Andreas
    [J]. 2017 24TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2017, : 3625 - 3629
  • [10] Automatic word spacing using probabilistic models based on character n-grams
    Lee, Do-Gil
    Rim, Hae-Chang
    Yook, Dongsuk
    [J]. IEEE INTELLIGENT SYSTEMS, 2007, 22 (01) : 28 - 35