Character N-Gram Tokenization for European Language Text Retrieval

被引:0
|
作者
Paul McNamee
James Mayfield
机构
[1] Johns Hopkins University,Applied Physics Laboratory
来源
Information Retrieval | 2004年 / 7卷
关键词
cross-language information retrieval; language-neutral retrieval; character ; -grams; Cross Language Evaluation Forum; European languages;
D O I
暂无
中图分类号
学科分类号
摘要
The Cross-Language Evaluation Forum has encouraged research in text retrieval methods for numerous European languages and has developed durable test suites that allow language-specific techniques to be investigated and compared. The labor associated with crafting a retrieval system that takes advantage of sophisticated linguistic methods is daunting. We examine whether language-neutral methods can achieve accuracy comparable to language-specific methods with less concomitant software complexity. Using the CLEF 2002 test set we demonstrate empirically how overlapping character n-gram tokenization can provide retrieval accuracy that rivals the best current language-specific approaches for European languages. We show that n = 4 is a good choice for those languages, and document the increased storage and time requirements of the technique. We report on the benefits of and challenges posed by n-grams, and explain peculiarities attendant to bilingual retrieval. Our findings demonstrate clearly that accuracy using n-gram indexing rivals or exceeds accuracy using unnormalized words, for both monolingual and bilingual retrieval.
引用
收藏
页码:73 / 97
页数:24
相关论文
共 50 条
  • [1] Character N-gram tokenization for European language text retrieval
    McNamee, P
    Mayfield, J
    [J]. INFORMATION RETRIEVAL, 2004, 7 (1-2): : 73 - 97
  • [2] Character-Based N-gram Model for Uyghur Text Retrieval
    Tohti, Turdi
    Xu, Lirui
    Huang, Jimmy
    Musajan, Winira
    Hamdulla, Askar
    [J]. BIOMETRIC RECOGNITION, CCBR 2018, 2018, 10996 : 678 - 688
  • [3] Multilingual Text Categorization Using Character N-gram
    Suzuki, Makoto
    Yamagishi, Naohide
    Tsai, Yi-Ching
    Hirasawa, Shigeichi
    [J]. 2008 IEEE CONFERENCE ON SOFT COMPUTING IN INDUSTRIAL APPLICATIONS SMCIA/08, 2009, : 49 - +
  • [4] Chinese Text Categorization Using the Character N-gram
    Suzuki, Makoto
    Yamagishi, Naohide
    Tsai, Yi-Ching
    [J]. 2012 INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY AND ITS APPLICATIONS (ISITA 2012), 2012, : 722 - 726
  • [5] N-GRAM ANALYSIS OF TEXT DOCUMENTS IN SERBIAN LANGUAGE
    Marovac, Ulfeta
    Pljaskovic, Aldina
    Crnisanin, Adela
    Kajan, Ejub
    [J]. 2012 20TH TELECOMMUNICATIONS FORUM (TELFOR), 2012, : 1385 - 1388
  • [6] An Evaluation of Character Level N-gram Termsets in Text Categorization
    Coban, Onder
    Ozel, Selma Ayse
    [J]. 2018 INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND DATA PROCESSING (IDAP), 2018,
  • [7] N-gram and local context analysis for Persian text retrieval
    Aleahmad, Abolfazl
    Hakimian, Parsia
    Mahdikhani, Farzad
    Oroumchian, Farhad
    [J]. 2007 9TH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, VOLS 1-3, 2007, : 284 - 287
  • [8] Tokenization and N-gram for Indexing Indonesian Translation of the Quran
    Putra, Syopiansyah Jaya
    Gunawan, Muhamad Nur
    Suryatno, Agung
    [J]. 2018 6TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION TECHNOLOGY (ICOICT), 2018, : 158 - 161
  • [9] Evaluation of N-Gram Conflation Approaches for Arabic Text Retrieval
    Ahmed, Farag
    Nuernberger, Andreas
    [J]. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE AND TECHNOLOGY, 2009, 60 (07): : 1448 - 1465
  • [10] Character n-Gram Embeddings to Improve RNN Language Models
    Takase, Sho
    Suzuki, Jun
    Nagata, Masaaki
    [J]. THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 5074 - 5082