Character N-Gram Tokenization for European Language Text Retrieval

被引：0

作者：

Paul McNamee

James Mayfield

机构：

[1] Johns Hopkins University,Applied Physics Laboratory

来源：

Information Retrieval | 2004年 / 7卷

关键词：

cross-language information retrieval; language-neutral retrieval; character ; -grams; Cross Language Evaluation Forum; European languages;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

The Cross-Language Evaluation Forum has encouraged research in text retrieval methods for numerous European languages and has developed durable test suites that allow language-specific techniques to be investigated and compared. The labor associated with crafting a retrieval system that takes advantage of sophisticated linguistic methods is daunting. We examine whether language-neutral methods can achieve accuracy comparable to language-specific methods with less concomitant software complexity. Using the CLEF 2002 test set we demonstrate empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages. We show that n = 4 is a good choice for those languages, and document the increased storage and time requirements of the technique. We report on the benefits of and challenges posed by n-grams, and explain peculiarities attendant to bilingual retrieval. Our findings demonstrate clearly that accuracy using n-gram indexing rivals or exceeds accuracy using unnormalized words, for both monolingual and bilingual retrieval.

引用

页码：73 / 97

页数：24

共 50 条

[41] Improved N-gram Phonotactic Models For Language Recognition
BenZeghiba, Mohamed Faouzi
Gauvain, Jean-Luc
Lamel, Lori
[J]. 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, 2010, : 2718 - 2721
[42] A language independent n-gram model for word segmentation
Kang, Seung-Shik
Hwang, Kyu-Baek
[J]. AI 2006: ADVANCES IN ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2006, 4304 : 557 - +
[43] Answering questions with an n-gram based passage retrieval engine
Davide Buscaldi
Paolo Rosso
José Manuel Gómez-Soriano
Emilio Sanchis
[J]. Journal of Intelligent Information Systems, 2010, 34 : 113 - 134
[44] Efficient MDI Adaptation for n-gram Language Models
Huang, Ruizhe
Li, Ke
Arora, Ashish
Povey, Daniel
Khudanpur, Sanjeev
[J]. INTERSPEECH 2020, 2020, : 4916 - 4920
[45] N-gram language models for massively parallel devices
Bogoychev, Nikolay
Lopez, Adam
[J]. PROCEEDINGS OF THE 54TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 1, 2016, : 1944 - 1953
[46] Language Identification based on n-gram Frequency Ranking
Cordoba, R.
D'Haro, L. F.
Fernandez-Martinez, F.
Macias-Guarasa, J.
Ferreiros, J.
[J]. INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, 2007, : 1921 - 1924
[47] Constrained Discriminative Training of N-gram Language Models
Rastrow, Ariya
Sethy, Abhinav
Ramabhadran, Bhuvana
[J]. 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), 2009, : 311 - +
[48] POWER LAW DISCOUNTING FOR N-GRAM LANGUAGE MODELS
Huang, Songfang
Renals, Steve
[J]. 2010 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, 2010, : 5178 - 5181
[49] Multilingual stochastic n-gram class language models
Jardino, M
[J]. 1996 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, CONFERENCE PROCEEDINGS, VOLS 1-6, 1996, : 161 - 163
[50] N-gram language models for document image decoding
Kopec, GE
Said, MR
Popat, K
[J]. DOCUMENT RECOGNITION AND RETRIEVAL IX, 2002, 4670 : 191 - 202

← 1 2 3 4 5 →