Character N-Gram Tokenization for European Language Text Retrieval

被引:0
|
作者
Paul McNamee
James Mayfield
机构
[1] Johns Hopkins University,Applied Physics Laboratory
来源
Information Retrieval | 2004年 / 7卷
关键词
cross-language information retrieval; language-neutral retrieval; character ; -grams; Cross Language Evaluation Forum; European languages;
D O I
暂无
中图分类号
学科分类号
摘要
The Cross-Language Evaluation Forum has encouraged research in text retrieval methods for numerous European languages and has developed durable test suites that allow language-specific techniques to be investigated and compared. The labor associated with crafting a retrieval system that takes advantage of sophisticated linguistic methods is daunting. We examine whether language-neutral methods can achieve accuracy comparable to language-specific methods with less concomitant software complexity. Using the CLEF 2002 test set we demonstrate empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages. We show that n = 4 is a good choice for those languages, and document the increased storage and time requirements of the technique. We report on the benefits of and challenges posed by n-grams, and explain peculiarities attendant to bilingual retrieval. Our findings demonstrate clearly that accuracy using n-gram indexing rivals or exceeds accuracy using unnormalized words, for both monolingual and bilingual retrieval.
引用
收藏
页码:73 / 97
页数:24
相关论文
共 50 条
  • [31] Perplexity of n-Gram and Dependency Language Models
    Popel, Martin
    Marecek, David
    [J]. TEXT, SPEECH AND DIALOGUE, 2010, 6231 : 173 - 180
  • [32] A variant of n-gram based language classification
    Tomovic, Andrija
    Janicic, Predrag
    [J]. AI(ASTERISK)IA 2007: ARTIFICIAL INTELLIGENCE AND HUMAN-ORIENTED COMPUTING, 2007, 4733 : 410 - +
  • [33] Development of the N-gram Model for Azerbaijani Language
    Bannayeva, Aliya
    Aslanov, Mustafa
    [J]. 2020 IEEE 14TH INTERNATIONAL CONFERENCE ON APPLICATION OF INFORMATION AND COMMUNICATION TECHNOLOGIES (AICT2020), 2020,
  • [34] Discriminative N-gram Language Modeling for Turkish
    Arisoy, Ebru
    Roark, Brian
    Shafran, Izhak
    Saraclar, Murat
    [J]. INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, 2008, : 825 - +
  • [35] Improved Text Generation Using N-gram Statistics
    de Novais, Eder Miranda
    Tadeu, Thiago Dias
    Paraboni, Ivandre
    [J]. ADVANCES IN ARTIFICIAL INTELLIGENCE - IBERAMIA 2010, 2010, 6433 : 316 - 325
  • [36] Character n-gram application for automatic new topic identification
    Gencosman, Burcu Caglar
    Ozmutlu, Huseyin C.
    Ozmutlu, Seda
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2014, 50 (06) : 821 - 856
  • [37] Detecting Spam Tweets using Character N-gram Features
    Ashour, Mokhtar
    Salama, Cherif
    El-Kharashi, M. Watheq
    [J]. PROCEEDINGS OF 2018 13TH INTERNATIONAL CONFERENCE ON COMPUTER ENGINEERING AND SYSTEMS (ICCES), 2018, : 190 - 195
  • [38] Profile based compression of n-gram language models
    Olsen, Jesper
    Oria, Daniela
    [J]. 2006 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING, VOLS 1-13, 2006, : 1041 - 1044
  • [39] Bayesian learning of n-gram statistical language modeling
    Bai, Shuanhu
    Li, Haizhou
    [J]. 2006 IEEE International Conference on Acoustics, Speech and Signal Processing, Vols 1-13, 2006, : 1045 - 1048
  • [40] n-BiLSTM: BiLSTM with n-gram Features for Text Classification
    Zhang, Yunxiang
    Rao, Zhuyi
    [J]. PROCEEDINGS OF 2020 IEEE 5TH INFORMATION TECHNOLOGY AND MECHATRONICS ENGINEERING CONFERENCE (ITOEC 2020), 2020, : 1056 - 1059