Character N-gram tokenization for European language text retrieval

被引:151
|
作者
McNamee, P [1 ]
Mayfield, J [1 ]
机构
[1] Johns Hopkins Univ, Appl Phys Lab, Laurel, MD 20723 USA
来源
INFORMATION RETRIEVAL | 2004年 / 7卷 / 1-2期
关键词
cross-language information retrieval; language-neutral retrieval; character n-grams; Cross Language Evaluation Forum; European languages;
D O I
10.1023/B:INRT.0000009441.78971.be
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The Cross-Language Evaluation Forum has encouraged research in text retrieval methods for numerous European languages and has developed durable test suites that allow language-specific techniques to be investigated and compared. The labor associated with crafting a retrieval system that takes advantage of sophisticated linguistic methods is daunting. We examine whether language-neutral methods can achieve accuracy comparable to language-specific methods with less concomitant software complexity. Using the CLEF 2002 test set we demonstrate empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages. We show that n=4 is a good choice for those languages, and document the increased storage and time requirements of the technique. We report on the benefits of and challenges posed by n-grams, and explain peculiarities attendant to bilingual retrieval. Our findings demonstrate clearly that accuracy using n-gram indexing rivals or exceeds accuracy using unnormalized words, for both monolingual and bilingual retrieval.
引用
收藏
页码:73 / 97
页数:25
相关论文
共 50 条
  • [41] Improved N-gram Phonotactic Models For Language Recognition
    BenZeghiba, Mohamed Faouzi
    Gauvain, Jean-Luc
    Lamel, Lori
    [J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2718 - 2721
  • [42] A language independent n-gram model for word segmentation
    Kang, Seung-Shik
    Hwang, Kyu-Baek
    [J]. AI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4304 : 557 - +
  • [43] A language independent n-gram model for word segmentation
    Kang, Seung-Shik
    Hwang, Kyu-Baek
    [J]. Lect. Notes Comput. Sci., 1600, (557-565):
  • [44] Language Identification based on n-gram Frequency Ranking
    Cordoba, R.
    D'Haro, L. F.
    Fernandez-Martinez, F.
    Macias-Guarasa, J.
    Ferreiros, J.
    [J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 1921 - 1924
  • [45] N-gram language models for massively parallel devices
    Bogoychev, Nikolay
    Lopez, Adam
    [J]. PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 1944 - 1953
  • [46] Efficient MDI Adaptation for n-gram Language Models
    Huang, Ruizhe
    Li, Ke
    Arora, Ashish
    Povey, Daniel
    Khudanpur, Sanjeev
    [J]. INTERSPEECH 2020, 2020, : 4916 - 4920
  • [47] Answering questions with an n-gram based passage retrieval engine
    Davide Buscaldi
    Paolo Rosso
    José Manuel Gómez-Soriano
    Emilio Sanchis
    [J]. Journal of Intelligent Information Systems, 2010, 34 : 113 - 134
  • [48] An efficient document retrieval method using n-gram indexing
    Ogawa, Yasushi
    Matsuda, Toru
    [J]. Systems and Computers in Japan, 2002, 33 (02) : 54 - 63
  • [49] POWER LAW DISCOUNTING FOR N-GRAM LANGUAGE MODELS
    Huang, Songfang
    Renals, Steve
    [J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5178 - 5181
  • [50] Multilingual stochastic n-gram class language models
    Jardino, M
    [J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 161 - 163