Character N-gram tokenization for European language text retrieval

被引：151

作者：

McNamee, P ^{[1
]}

Mayfield, J ^{[1
]}

机构：

[1] Johns Hopkins Univ, Appl Phys Lab, Laurel, MD 20723 USA

来源：

INFORMATION RETRIEVAL | 2004年 / 7卷 / 1-2期

关键词：

cross-language information retrieval; language-neutral retrieval; character n-grams; Cross Language Evaluation Forum; European languages;

D O I：

10.1023/B:INRT.0000009441.78971.be

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

The Cross-Language Evaluation Forum has encouraged research in text retrieval methods for numerous European languages and has developed durable test suites that allow language-specific techniques to be investigated and compared. The labor associated with crafting a retrieval system that takes advantage of sophisticated linguistic methods is daunting. We examine whether language-neutral methods can achieve accuracy comparable to language-specific methods with less concomitant software complexity. Using the CLEF 2002 test set we demonstrate empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages. We show that n=4 is a good choice for those languages, and document the increased storage and time requirements of the technique. We report on the benefits of and challenges posed by n-grams, and explain peculiarities attendant to bilingual retrieval. Our findings demonstrate clearly that accuracy using n-gram indexing rivals or exceeds accuracy using unnormalized words, for both monolingual and bilingual retrieval.

引用

页码：73 / 97

页数：25

共 50 条

[41] Improved N-gram Phonotactic Models For Language Recognition
BenZeghiba, Mohamed Faouzi
Gauvain, Jean-Luc
Lamel, Lori
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2718 - 2721
[42] A language independent n-gram model for word segmentation
Kang, Seung-Shik
Hwang, Kyu-Baek
[J]. AI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4304 : 557 - +
[43] A language independent n-gram model for word segmentation
Kang, Seung-Shik
Hwang, Kyu-Baek
[J]. Lect. Notes Comput. Sci., 1600, (557-565):
[44] Language Identification based on n-gram Frequency Ranking
Cordoba, R.
D'Haro, L. F.
Fernandez-Martinez, F.
Macias-Guarasa, J.
Ferreiros, J.
[J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 1921 - 1924
[45] N-gram language models for massively parallel devices
Bogoychev, Nikolay
Lopez, Adam
[J]. PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 1944 - 1953
[46] Efficient MDI Adaptation for n-gram Language Models
Huang, Ruizhe
Li, Ke
Arora, Ashish
Povey, Daniel
Khudanpur, Sanjeev
[J]. INTERSPEECH 2020, 2020, : 4916 - 4920
[47] Answering questions with an n-gram based passage retrieval engine
Davide Buscaldi
Paolo Rosso
José Manuel Gómez-Soriano
Emilio Sanchis
[J]. Journal of Intelligent Information Systems, 2010, 34 : 113 - 134
[48] An efficient document retrieval method using n-gram indexing
Ogawa, Yasushi
Matsuda, Toru
[J]. Systems and Computers in Japan, 2002, 33 (02) : 54 - 63
[49] POWER LAW DISCOUNTING FOR N-GRAM LANGUAGE MODELS
Huang, Songfang
Renals, Steve
[J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5178 - 5181
[50] Multilingual stochastic n-gram class language models
Jardino, M
[J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 161 - 163

← 1 2 3 4 5 →