Corpus-based stemming using cooccurrence of word variants

被引:131
|
作者
Xu, JX [1 ]
Croft, WB [1 ]
机构
[1] Univ Massachusetts, Dept Comp Sci, Amherst, MA 01003 USA
关键词
algorithms; experimentation; performance; class refinement; cooccurrence; corpus analysis; information retrieval; n-gram; stemming;
D O I
10.1145/267954.267957
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Stemming is used in many information retrieval (IR) systems to reduce variant word forms to common roots. It is one of the simplest applications of natural-language processing to IR and is one of the most effective in terms of user acceptance and consistency, though small retrieval improvements. Current stemming techniques do not, however, reflect the language use in specific corpora, and this can lead to occasional serious retrieval failures. We propose a technique for using corpus-based word variant cooccurrence statistics to modify or create a stemmer. The experimental results generated using English newspaper and legal text and Spanish text demonstrate the viability of this technique and its advantages relative to conventional approaches that only employ morphological rules.
引用
收藏
页码:61 / 81
页数:21
相关论文
共 50 条
  • [21] Word-formation in New Englishes: A Corpus-based Analysis
    Haselow, Alexander
    ANGLIA-ZEITSCHRIFT FUR ENGLISCHE PHILOLOGIE, 2010, 128 (01): : 131 - 135
  • [22] Cognitive determinants of subtractive word formation: A corpus-based perspective
    Gries, Stefan Th.
    COGNITIVE LINGUISTICS, 2006, 17 (04) : 535 - 558
  • [23] Word-Formation in New Englishes: A Corpus-based Analysis
    Klegr, Ales
    Saldova, Pavlina
    ZEITSCHRIFT FUR ANGLISTIK UND AMERIKANISTIK, 2010, 58 (03): : 305 - 307
  • [24] Root word stemming by multiple evidence from corpus
    Sharma, U
    Kalita, J
    Das, R
    PROCEEDINGS OF THE 7TH JOINT CONFERENCE ON INFORMATION SCIENCES, 2003, : 1593 - 1596
  • [25] Experiments on the use of corpus-based word BI-gram in Chinese word segmentation
    Xu, RF
    Yeung, D
    1998 IEEE INTERNATIONAL CONFERENCE ON SYSTEMS, MAN, AND CYBERNETICS, VOLS 1-5, 1998, : 4222 - 4227
  • [26] A Corpus-Based Approach for Automatic Thai Unknown Word Recognition using Ensemble Learning Techniques
    TeCho, Jakkrit
    Nattee, Cholwich
    Theeramunkong, Thanaruk
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2009, 5476 : 533 - 540
  • [28] Creating and validating a corpus-based English academic word list for physics
    Vukovic-Stamatovic, Milica
    REVISTA ESPANOLA DE LINGUISTICA APLICADA, 2025, 38 (01): : 80 - 106
  • [29] Corpus-Based Textual Research on the Meanings of the Chinese Word "Xifu(r)"
    Wang, Jingmin
    CHINESE LEXICAL SEMANTICS (CLSW 2019), 2020, 11831 : 674 - 680
  • [30] A Neuro-Evolutionary Corpus-Based Method for Word Sense Disambiguation
    Azzini, Antonia
    Pereira, Celia da Costa
    Dragoni, Mauro
    Tettamanzi, Andrea G. B.
    IEEE INTELLIGENT SYSTEMS, 2012, 27 (06) : 26 - 35