Corpus-based stemming using cooccurrence of word variants

被引:131
|
作者
Xu, JX [1 ]
Croft, WB [1 ]
机构
[1] Univ Massachusetts, Dept Comp Sci, Amherst, MA 01003 USA
关键词
algorithms; experimentation; performance; class refinement; cooccurrence; corpus analysis; information retrieval; n-gram; stemming;
D O I
10.1145/267954.267957
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural-language processing to IR and is one of the most effective in terms of user acceptance and consistency, though small retrieval improvements. Current stemming techniques do not, however, reflect the language use in specific corpora, and this can lead to occasional serious retrieval failures. We propose a technique for using corpus-based word variant cooccurrence statistics to modify or create a stemmer. The experimental results generated using English newspaper and legal text and Spanish text demonstrate the viability of this technique and its advantages relative to conventional approaches that only employ morphological rules.
引用
收藏
页码:61 / 81
页数:21
相关论文
共 50 条
  • [31] Corpus-based interpreting studies as an offshoot of corpus-based translation studies
    Shlesinger, M
    META, 1998, 43 (04) : 486 - 493
  • [32] Metonymy in the semantic field of verbal communication: A corpus-based analysis of WORD
    Adel, Annelle
    JOURNAL OF PRAGMATICS, 2014, 67 : 72 - 88
  • [33] The Semantics of Glory: A Cognitive, Corpus-Based Approach to Hebrew Word Meaning
    van der Merwe, Christo H. J.
    JOURNAL OF NORTHWEST SEMITIC LANGUAGES, 2018, 44 (01) : 89 - +
  • [34] A corpus-based analysis of the fortition of the word-initial / 3 / in French
    Deng, Delin
    LINGUA, 2024, 306
  • [35] Nominal Word Formation and Additions with super. A corpus-based Comparison
    Blondeel, Jana
    De Cuypere, Ludovic
    Willems, Klaas
    DEUTSCHE SPRACHE, 2017, 45 (01): : 53 - 77
  • [36] Nigerian English pronunciation preferences: A corpus-based survey of pronunciation variants
    Oladipupo, Rotimi
    Akinola, Aderonke
    COGENT ARTS & HUMANITIES, 2022, 9 (01):
  • [37] Portuguese corpus-based learning using ETL
    Milidiú, Ruy Luiz
    dos Santos, Cícero Nogueira
    Duarte, Julio Cesar
    Journal of the Brazilian Computer Society, 2008, 14 (04) : 17 - 27
  • [38] Corpus-based research on English word recognition rates in primary school and word selection strategy
    Xiao, Wen-yan
    Wang, Ming-wen
    Weng, Zhen
    Zhang, Li-lin
    Zuo, Jia-li
    FRONTIERS OF INFORMATION TECHNOLOGY & ELECTRONIC ENGINEERING, 2017, 18 (03) : 362 - 372
  • [39] Corpus-based research on English word recognition rates in primary school and word selection strategy
    Wen-yan XIAO
    Ming-wen WANG
    Zhen WENG
    Li-lin ZHANG
    Jia-li ZUO
    FrontiersofInformationTechnology&ElectronicEngineering, 2017, 18 (03) : 362 - 372
  • [40] Corpus-based research on English word recognition rates in primary school and word selection strategy
    Wen-yan Xiao
    Ming-wen Wang
    Zhen Weng
    Li-lin Zhang
    Jia-li Zuo
    Frontiers of Information Technology & Electronic Engineering, 2017, 18 : 362 - 372