Character N-Gram Tokenization for European Language Text Retrieval

被引:0
|
作者
Paul McNamee
James Mayfield
机构
[1] Johns Hopkins University,Applied Physics Laboratory
来源
Information Retrieval | 2004年 / 7卷
关键词
cross-language information retrieval; language-neutral retrieval; character ; -grams; Cross Language Evaluation Forum; European languages;
D O I
暂无
中图分类号
学科分类号
摘要
The Cross-Language Evaluation Forum has encouraged research in text retrieval methods for numerous European languages and has developed durable test suites that allow language-specific techniques to be investigated and compared. The labor associated with crafting a retrieval system that takes advantage of sophisticated linguistic methods is daunting. We examine whether language-neutral methods can achieve accuracy comparable to language-specific methods with less concomitant software complexity. Using the CLEF 2002 test set we demonstrate empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages. We show that n = 4 is a good choice for those languages, and document the increased storage and time requirements of the technique. We report on the benefits of and challenges posed by n-grams, and explain peculiarities attendant to bilingual retrieval. Our findings demonstrate clearly that accuracy using n-gram indexing rivals or exceeds accuracy using unnormalized words, for both monolingual and bilingual retrieval.
引用
收藏
页码:73 / 97
页数:24
相关论文
共 50 条
  • [41] Improved N-gram Phonotactic Models For Language Recognition
    BenZeghiba, Mohamed Faouzi
    Gauvain, Jean-Luc
    Lamel, Lori
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2718 - 2721
  • [42] A language independent n-gram model for word segmentation
    Kang, Seung-Shik
    Hwang, Kyu-Baek
    [J]. AI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4304 : 557 - +
  • [43] Answering questions with an n-gram based passage retrieval engine
    Davide Buscaldi
    Paolo Rosso
    José Manuel Gómez-Soriano
    Emilio Sanchis
    [J]. Journal of Intelligent Information Systems, 2010, 34 : 113 - 134
  • [44] Efficient MDI Adaptation for n-gram Language Models
    Huang, Ruizhe
    Li, Ke
    Arora, Ashish
    Povey, Daniel
    Khudanpur, Sanjeev
    [J]. INTERSPEECH 2020, 2020, : 4916 - 4920
  • [45] N-gram language models for massively parallel devices
    Bogoychev, Nikolay
    Lopez, Adam
    [J]. PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 1944 - 1953
  • [46] Language Identification based on n-gram Frequency Ranking
    Cordoba, R.
    D'Haro, L. F.
    Fernandez-Martinez, F.
    Macias-Guarasa, J.
    Ferreiros, J.
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 1921 - 1924
  • [47] Constrained Discriminative Training of N-gram Language Models
    Rastrow, Ariya
    Sethy, Abhinav
    Ramabhadran, Bhuvana
    [J]. 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 311 - +
  • [48] POWER LAW DISCOUNTING FOR N-GRAM LANGUAGE MODELS
    Huang, Songfang
    Renals, Steve
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5178 - 5181
  • [49] Multilingual stochastic n-gram class language models
    Jardino, M
    [J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 161 - 163
  • [50] N-gram language models for document image decoding
    Kopec, GE
    Said, MR
    Popat, K
    [J]. DOCUMENT RECOGNITION AND RETRIEVAL IX, 2002, 4670 : 191 - 202